Supplementary Codes

Reproducible walk-through of codes and data for Buphamalai et.al.

true , true , true , true
2021-08-12

Introduction

This supplementary report is aimed to be a reproducible walk-through guide for figures and analyses complementing the manuscript: Buphamalai et.al., Network analysis reveals rare disease signatures across multiple levels of biological organization, submitted to Nature Communications. PLease find the following guidelines:

  1. The appearance of sections in this document is at the same order as in the manuscript, and can be navigated using the Table of Content (ToC) appeared on the top left corner of this document, with subsections corresponding to exact figures/statistics
  2. The corresponding code chunks to each figures/analyses are provided above the output, and are hidden by default. To expand each code chunk, click Show code.
  3. This report mainly contains visualization and post-processing of major analyses. Heavier computations were pre-computed, and corresponding R, sh, or py scripts required for each analysis are mentioned for each section. These files are available in ./source folder.
  4. The pre-computed results are saved in ./cache folder and can be downloaded from: link
  5. The corresponding Rmd files used to produce this report can be found in ./report/ folder. The report mainly focuses on reproducing results shown in the manuscripts. For explanation, discussion, and references, please refer to the main manuscript.

Note: genome-phenome data used in the analysis in 7.1_Prioritize_ID_RDconnect.Rmd is subject to controlled access by RD-connect.

Characterizing the network architectures across biological scales

Show code
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
library(pacman)
p_load(patchwork, igraph, tidyverse, cowplot, rmarkdown)

# compute different network properties
if(!file.exists("../cache/network_complementarity_topological.RDS")){
  # load required functions
source("../source/network_properties_analysis.R")

} else{
  print("Load precomputed data")
  g_prop_df <- readRDS("../cache/network_complementarity_topological.RDS")
}
[1] "Load precomputed data"
Show code
network_details <- read_tsv("../data/network_details.tsv", col_types = 'ccccc')

Network details (Suppl. Data 1)

45 Network layers from six major databases were constructed as detailed below:

Show code
paged_table(network_details %>% select(!type))

Topological complementarity (Fig. 1h)

In addition to the scale comprehensiveness, these networks are also topologically complementary. A number of key network properties including node and link coverage, modularity, assortativity, and social bias, have been compared and shown below.

Social bias: many networks were constructed based on curation from literatures. The social bias of a network is assessed by the Spearman’s correlation coefficient between the network degree of a gene and the number of publications mentioning the gene. The number of publications was queried using the INDRA python module (http://www.indra.bio, accessed on 12 April 2019)

Show code
g_prop_df %>% 
  mutate(type = ifelse(grepl("coex", network), "co-expression",network)) %>%
  group_by(type, property) %>%
  summarise(value = mean(value)) %>%
  paged_table()

Note that, for the co-expression layer, the values showed on the table above are averaged from all 38 tissue-specific networks.

The plot below summarises the table properties.

Show code
# create a list of  plots to patch together
plot = list()
for(prop in unique(g_prop_df$property)){
  plot[[prop]] = g_prop_df  %>% 
    arrange(group) %>% filter(property == prop) %>%
    ggplot( aes(x=group, y=as.numeric(value))) + 
    geom_segment( aes(x=group, xend=group, y=0, yend=value), color="grey80", size=1.5) +
    geom_violin(fill="#F8B100", alpha = 0.4, color = NA) +
    geom_point( aes(color=alphaval), size=4, alpha=0.6) +
    theme_light() + 
    coord_flip() +
    theme(
      panel.grid.major.y = element_blank(),
      panel.border = element_blank(),
      axis.ticks.y = element_blank(),
     # axis.text.y = element_blank(),
    ) +
    guides(color = F)+
    xlab("") +
    scale_color_manual(values = c("#F8B100",NA)) +
    ylab(prop)
  
  # scale y log for some properties (n edges)
  if(prop %in% c("Number of edges")){
    plot[[prop]] = plot[[prop]] + scale_y_log10()
  }
  
  # for the first plot, allows axis label
  if(!prop %in% c("Number of nodes")){
    plot[[prop]] = plot[[prop]] +  theme(axis.text.y = element_blank())
  }
  
}

plot_combine = plot$`Number of nodes` + plot$`Edge density` + plot$`Global clustering` + plot$Assortativity + plot$`Social bias` + plot_layout(nrow = 1)

# uncomment to save the plot as pdf
#ggsave("../Figs/network_properties_characterisation.pdf", plot_combine, height = 2.5, width = 9)

suppressWarnings(print(plot_combine))

The Social bias of the networks (Suppl. Fig. 1f)

The network similarity (Fig. 1f, g)

We quantified the similarity of a given pair of networks \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\) using the edge overlap index: \[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\] We used a dissimilarity measure defined as \(d_{AB} = 1 - S_{AB}\) to construct a 2D map \(\mathbf{X} \subset \mathbb{R}^{2}\) that preserves network dissimilarities by employing Kruskal’s non-metric multidimensional scaling (R package MASS) 75. Finally, we compared the measured similarity of each network pair to random expectation: For each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair which we used as random reference distribution to assess the measured overlap similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant.

The MDS plot derived from Jaccard and Overlap Similarity is as follows:

Show code
pacman::p_load(ggrepel, MASS)

# load the precomputed data
if(!file.exists("../cache/network_jaccard_overlap_similarity_df.RDS")){
  source("../source/compute_jaccard_similarity.R")
} else{
  print("load pre-computed network similarity data")
  network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS")
}
[1] "load pre-computed network similarity data"
Show code
# turn df to weight symmatrix matrix through graph
g_overlap <- graph_from_data_frame(network_sim_df[,c(1,2,4)] %>% 
                                     rename(., weight = overlapindex), directed = F)

sim_overlap <- get.adjacency(g_overlap, attr = "weight")

diag(sim_overlap) = 1

#change similarity to distance
dist_overlap = 1 - sim_overlap 

############
# MDS plot normal 

#### MDS plot for Kruskal
mds<- isoMDS(as.matrix(dist_overlap), k = 2)
initial  value 46.790384 
iter   5 value 32.191184
iter  10 value 23.308862
iter  15 value 21.770579
final  value 21.704111 
converged
Show code
# a data frame of MDS values
mds_df = data.frame(x = mds$points[,1], y = mds$points[,2], network = rownames(mds$points))

# add network metadata and node size
mds_df = mds_df %>%
  left_join(., g_prop_df %>% dplyr::filter(property=="Number of nodes") %>% 
              dplyr::select(network, value)) %>%
  left_join(., network_details) %>%
  dplyr::filter(!is.na(main_type)) %>%
  mutate(label = ifelse(!grepl("coex", network), subtype, "")
        # collabel = ifelse(!is.na(type), type, subtype)
         )

# plot the scatters of all networks
p <- mds_df %>% 
  ggplot() + 
  geom_point(aes(x, y, col = main_type, size = value), alpha = 0.5) + 
  geom_text_repel(aes(x, y, label = label)) + 
  theme_cowplot() +theme(
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    axis.ticks.y=element_blank(),
    axis.text.y=element_blank()) + 
  xlab("MDS1") + ylab("MDS2") +
  scale_color_manual(values = c("#F8B100", "#005564"))+
  guides(col = F, size  = F)


p
Show code
#ggsave("../Figs/scatter_Network_complementarity_MDS_Overlap.pdf",plot = p, width = 4, height = 4)

Summary statistics of the network similarity

Overlap index:

Show code
network_sim_df %>% pull(overlapindex) %>% summary
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0001835 0.0121798 0.0331371 0.0419274 0.0543372 0.4899776 

Median overlap indices between co-expression networks are 0.0433192 and non co-expression networks are 0.0183954.

Show code
## Jaccard heatmap

## remove coex_core from the map
p1 = network_sim_df[,1:3] %>% dplyr::filter(!grepl("core", V1), !grepl("core", V2)) %>% 
  # heatmap plot
  ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity among all networks") + theme_minimal()+ theme(axis.text.x = element_text(angle = 90))

p1

Global similarity (Suppl. Fig. 1g)

We use core transcriptional modules to represent all of the co-expression network. The heatmap below shows the Jaccard and Overlap similarity.

Show code
## remove coex_core from the map
considered_networks = c("coex_core", "reactome_copathway", "ppi", "MP", "HP", "GOMF", "GOBP")

labels = c("co-expression", "co-pathway", "PPI", "MP", "HP", "GOMF", "GOBP ")


# Jaccard index
#######
p1 = network_sim_df[,1:3] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>% 
  # rescale factor
  mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
         V2 = factor(V2, levels = considered_networks, labels = labels)) %>%
  # heatmap plot
  ggplot() + geom_tile(aes(x = V1, y = V2, fill = jaccardIndex)) + scale_fill_distiller(direction = 1) + xlab("") +ylab("") + ggtitle("Jaccard similarity") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#ggsave("../Figs/heatmap_overlap_index_aggregated_jaccard.pdf", p1, width = 5, height = 4)

# Overlap index
#######
overlap_df <- network_sim_df[,c(1,2,4)] %>% dplyr::filter(V1 %in% considered_networks, V2 %in% considered_networks ) %>% 
  # rescale factor
  mutate(V1 = factor(V1, levels = considered_networks, labels = labels),
         V2 = factor(V2, levels = considered_networks, labels = labels))

  # heatmap plot
p2 = ggplot(overlap_df) + geom_tile(aes(x = V1, y = V2, fill = overlapindex)) + 
  scale_fill_distiller(direction = 1) + 
  xlab("") +ylab("") + 
  ggtitle("overlap similarity ") + 
  theme_cowplot() + 
  theme(axis.text.x = element_text(angle = 45, hjust=1), legend.position = "bottom")

#ggsave("../Figs/heatmap_overlap_index_aggregated_overlap.pdf", p2, width = 4, height = 5)

p1 + p2

Show code
sprintf("average overlap index for all networks are %f", mean(overlap_df$overlapindex))
[1] "average overlap index for all networks are 0.060363"

Network similairty randomization (Suupl. Fig. 1h)

For a given pair of network \(g_A \in G(V_A, E_A)\) and \(g_B \in G(V_A, E_A)\), we computed edge similarity through overlap index:

\[S_{AB}=\dfrac{|E_A \cap E_B|}{\text{min}(|E_A|,|E_B|)}\]

Foe each network, we performed 10 permutations of node indices, resulting in 100 permutations for a network pair, wehere we obtained the reference distribution for their similarity. We then computed \(z\)-score and the corresponding empirical \(p\)-value. A network pair with \(p-\)value < 0.05 is considered significant (../source/network_overlap_randomisation.R)

Show code
pacman::p_load(tidyverse, patchwork, ggrepel, cowplot)

# Perform the randomisation, or load from cahe (recommend)
source("../source/network_overlap_randomisation.R")
[1] "load the precomputed value from cache"
Show code
# network similarity values
network_sim_df <- readRDS("../cache/network_jaccard_overlap_similarity_df.RDS") %>% 
  dplyr::filter(!grepl("core", V1), !grepl("core" , V2))

# edge counts
ecounts <- readRDS("../cache/network_complementarity_topological.RDS")%>% 
  dplyr::filter(property=="Number of edges") %>% 
  pull(value, name = network) 

# compute minimum edge size for each pair
network_sim_df$min_ecount <- apply(network_sim_df, 1, function(x) min(ecounts[x[1]], ecounts[x[2]]))
Show code
# process jaccard and overlap index
jaccard_index <- lapply(randomisation_overlap_result, function(x) x$intersect/x$union)
overlap_index <- lapply(1:length(randomisation_overlap_result), function(x) 
  randomisation_overlap_result[[x]][['intersect']]/network_sim_df$min_ecount[x])

# compute mean and sd, zscore and pvalue
network_sim_df <- network_sim_df %>% 
  mutate(jaccard.mean = sapply(jaccard_index, mean),
         jaccard.sd = sapply(jaccard_index, sd),
   
         overlap.mean = sapply(overlap_index, mean),
         overlap.sd = sapply(overlap_index, sd),
         
         jaccard.zscore = (jaccardIndex-jaccard.mean)/jaccard.sd,
          jaccard.pval = pnorm(jaccard.zscore, lower.tail = F),
               
          overlap.zscore = (overlapindex-overlap.mean)/overlap.sd,
          overlap.pval = pnorm(overlap.zscore, lower.tail = F)
  )


# label p values into classes
network_sim_df <- network_sim_df %>% 
  # make p values in groups
  mutate(adjusted_overlapPval = p.adjust(overlap.pval, method = "BH"),
         overlap.pval_level = cut(adjusted_overlapPval, 
                                  breaks = rev(c(1,5e-2, 1e-3, 1e-4, 1e-5, 0)), 
                                  labels = rev(c("ns","*","**","***","****")), 
                                  include.lowest = T, ordered_result = T),
         # compute whether the pair are from both co-expression, or non co-expressions
         type1 = grepl("coex", V1),
         type2 = grepl("coex", V2),
         pair_name = paste(str_remove(V1,"coex_|reactome_"), 
                           str_remove(V2,"coex_|reactome_"), sep = " - "),
         # label only if overlap score higher than 0.2
         pair_label = ifelse(overlapindex > 0.2, pair_name, ""),
         type_pair = factor(type1+type2, levels = 0:2, labels = c("non.coex - non.coex",
                                                                  "coex - non.coex",
                                                                  "coex - coex"))) %>%
  mutate(overlap.pval_level = factor(overlap.pval_level, levels = rev(levels(overlap.pval_level)))) 


network_sim_df %>% count(overlap.pval_level) %>% paged_table()

Despite their wide range of similarity scores, we found that 955 out of 990 network pairs (96.5%) are significantly more similar than random expectation.

Show code
# stable plot results
p_scatter = ggplot(network_sim_df, aes(x = overlapindex, 
                                       y = log2(overlapindex/overlap.mean))) +
  geom_point(aes(col = overlap.pval_level)) + 
  scale_colour_viridis_d(direction = -1) + 
  theme_minimal() +
  xlab(expression(Similarity~(S[AB]))) + labs(title = "Network pair similarity", col = "significance") + 
  ylab(expression(log[2](S[AB]/mu[S[AB]]))) 

# plot by type 
p_scatter_by_type <- p_scatter + 
  facet_grid(. ~ type_pair) + 
  geom_text_repel(aes(label = pair_label)) + 
  theme_cowplot() +
  theme(legend.position = "bottom")


# p value plot by level
pval_lv_count_df <- network_sim_df %>%  count(overlap.pval_level, name = "count")

pval_lv_count_by_type_df  <- network_sim_df %>%  
  count(overlap.pval_level, type_pair, name = "count") %>%
  group_by(type_pair) %>%
  mutate(prop = count/sum(count)) 
  

p_count_by_type = ggplot(pval_lv_count_by_type_df, aes(x = overlap.pval_level, y = prop)) + 
  geom_col(aes(fill = overlap.pval_level)) + 
  scale_fill_viridis_d(direction = -1) + 
  xlab("Significance") + ylab("proportion") + guides(fill = F) +
  theme_minimal() + facet_grid(. ~ type_pair) + labs(title = "Network similarity significance level") +
  theme_cowplot()


p = p_scatter_by_type/p_count_by_type

#ggsave("../Figs/scatter_randomisation_network_similarity.pdf", p, width = 7, height = 6)

p

Interestingly, we also observed that networks on different scales (i.e. among non co-expression layers) are all significantly similar, showing that there are key edges being maintained across genotype to phenotype.

GTEx data characterisation (Suppl. Fig. )

Show code
library(patchwork)

if(!file.exists("../cache/coex_nets_node_and_edge_counts.RDS")){
  source("../source/coex_nets_node_and_edge_counts.R")
}

counts <- readRDS("../cache/coex_nets_node_and_edge_counts.RDS")

# Plotting
########

colors <- c('#7fc97f','#beaed4','#fdc086','#ffff99','#386cb0')

p_nodes <- ggplot(counts$node_counts, aes(x = Measure, y = count, fill = Measure)) + 
  geom_violin() + 
     geom_boxplot(width = 0.2, fill = "white") +
  guides(fill = F) + cowplot::theme_cowplot() + xlab("Criterion") + ylab("Count") + ggtitle("# Nodes") + scale_fill_manual(values = colors) 


p_edges <- ggplot(counts$edge_counts, aes(x = Measure, y = count, fill = Measure)) + 
  geom_violin() + 
   geom_boxplot(width = 0.2, fill = "white") +
  guides(fill = F) + cowplot::theme_cowplot() + xlab("Criterion") + ylab("Count") + ggtitle("# Edges") + scale_fill_manual(values = colors[4:5]) + scale_y_log10()


p_coex <- p_nodes +p_edges + plot_layout(widths = c(2, 1))

#ggsave("../Figs/number_of_nodes_and_edges_coexpression_at_each_stages.pdf", p_coex, width = 6, height = 3, scale = 1.5)


p_coex  

Additional network properties (PPI subsets, assorativity)

Properties and performance of PPI for large-scale and curated database

Show code
# read data
library(pacman)
p_load(cowplot, igraph)

#if(!file.exists("../cache/PPI_subset_and_CoexCore_properties.RDS")){
#  source("../source/compute_properties_PPIsubsets_and_CoexCore.R")
#}

#g_prop_Labelled_df <- readRDS("../cache/PPI_subset_and_CoexCore_properties.RDS")
#cached_assortativity <- readRDS("../cache/citation_assortativities_PPI.RDS")

#citation_assortativity <- cached_assortativity$citation_assortativity
#neighour_degree_avg <- cached_assortativity$neighour_degree_avg
#local_assortativity <- cached_assortativity$local_assortativity
#ppi_degree <- cached_assortativity$ppi_degree
Show code
p_load(ggforce, patchwork)

p_load(reshape2, patchwork, igraph)

source("../functions/readdata_functions.R")
g <- process_graph_data("../data/network_edgelist_additional//")
g$PPI_HIPPIE <- process_graph_data("../data/network_edgelists/ppi.tsv")[[1]]

# select  for plotting - those derived from PPI
g <- g[names(g)][grepl("PPI_HIPPIE|core", names(g))]


# code chunks from source/network_properties_analysis.R

all_nodes = sapply(g, function(x) V(x)$name) %>% unlist %>% unique 

# citation count (# PMID per genes, queried and processed by INDRA)
citation_count = read_csv("../data/all_pmids_counts.csv", col_names = c("gene", "count"), col_types = "ci")
citation_count = citation_count %>% arrange(-count) %>% mutate(rank = 1:nrow(.))

citation_correlation = function(graph){
  #' take the citation count and correlate it with the degree of each graph
  
  degree_g = degree(graph)
  degree_g = tibble(gene = names(degree_g), degree = degree_g) 
  degree_g = inner_join(degree_g, citation_count, by = "gene") %>% arrange(rank) 
  #  degree_g =degree_g %>% filter(degree > 0) %>% group_by(gr=cut(degree, breaks=c(1,5,10,50,100,500,1000,5000), right=FALSE))
  
  #corval = cor(degree_g$count, degree_g$degree, method = "spearman")
  return(degree_g)
}

degree_citation_df = lapply(g, citation_correlation)

n_nodes = sapply(g, vcount)
n_edges = sapply(g, ecount)
density = sapply(g, edge_density)
clustering = sapply(g, transitivity)
local_clustering = sapply(g, function(graph) transitivity(graph, type = "average"))
assortativity = sapply(g, function(graph) assortativity_degree(graph, directed = F))
social_bias = sapply(degree_citation_df, function(x) cor(x$count, x$degree, method = "spearman"))

# 3 - process the results and prepare for plotting

g_prop_df = tibble(network = names(n_nodes), n_nodes, n_edges, density, clustering, local_clustering, assortativity, social_bias)
g_prop_df = reshape2::melt(g_prop_df, variable.name = "property")


g_prop_df =g_prop_df %>%
  mutate(
    # rename properties for readability
    property = factor(property, labels = c("Number of nodes", "Number of edges", "Edge density", "Global clustering", "Avg local clustering", "Assortativity", "Social bias")),
    # groups to plot
    type = ifelse(grepl("core", network), "Co-expression", "PPI"), 
    main_type = type,
    subtype = sapply(network, function(x) strsplit(x ,"_")[[1]][2]), 
    group = subtype, 
    # alpha values (co-ex or not co-ex -- for aesthetic labelling)
    alphaval = 0
  )


# get the stats from the original values

g_prop_df_orig <- readRDS( "../cache/network_complementarity_topological.RDS")
g_prop_coex <- g_prop_df_orig %>% filter(type == "Co-expression") %>% select(-source)
g_prop_coex$group = "Tissue-specific"

g_prop_Labelled_df <- rbind(g_prop_df, g_prop_coex)

# make HIPPIE goes up the rank
g_prop_Labelled_df$group <- fct_relevel(g_prop_Labelled_df$group, "HIPPIE", "HIPPIELargeScale", "HIPPIECurated", "Tissue-specific")
g_prop_Labelled_df$group <- fct_relevel(g_prop_Labelled_df$group, rev)
g_prop_Labelled_df$group <- factor(g_prop_Labelled_df$group,
                                   levels = c("core", "Tissue-specific", "HIPPIECurated", "HIPPIELargeScale", "HIPPIE"), 
                                   labels = c("Core (multi-tissue)", "Tissue-specific", "Curated", "Large scale", "Full PPI"))


# create a list of  plots to patch together
plot_combine = list()
 for(type_network in unique(g_prop_Labelled_df$type)){
  plot = list()
    for(prop in unique(g_prop_Labelled_df$property)){
    plot[[prop]] = g_prop_Labelled_df  %>% 
      filter(type == type_network) %>%
      arrange(group) %>% filter(property == prop) %>%
      ggplot( aes(x=group, y=as.numeric(value))) + 
    #  facet_grid(type ~ ., scales = "free", space = "free") +
      geom_segment( aes(x=group, xend=group, y=0, yend=value), color="grey80", size=1.5) +
      geom_violin(fill="#F8B100", alpha = 0.4, color = NA) +
      geom_point( aes(color=alphaval), size=4, alpha=0.6) +
      geom_sina(aes(color='0'), size=4, alpha=0.6) + # add jitters
      geom_hline(yintercept=0, color='grey50', size=1,  linetype='dotted') +
      theme_light() + 
      coord_flip() +
      theme(
        panel.grid.major.y = element_blank(),
        panel.border = element_blank(),
        axis.ticks.y = element_blank(),
       # axis.text.y = element_blank(),
      ) +
      guides(color = F)+
      xlab("") +
      scale_color_manual(values = c("#F8B100",NA)) +
      ylab(prop)
    
    # scale y log for some properties (n edges)
    if(prop %in% c("Number of edges")){
      plot[[prop]] = plot[[prop]] + scale_y_log10()
    }
    
    # for the first plot, allows axis label
    if(!prop %in% c("Number of nodes")){
      plot[[prop]] = plot[[prop]] +  theme(axis.text.y = element_blank())
    }
    
  }
  
  plot_combine[[type_network]] = plot$`Number of nodes` + plot$`Edge density` + plot$`Global clustering` + plot$Assortativity + plot$`Social bias` + plot_layout(nrow = 1)

}


plot_combi <- plot_combine$PPI/plot_combine$`Co-expression`


# uncomment to save the plot as pdf
#ggsave("../Figs/network_properties_characterisation.pdf", plot_combine, height = 2.5, width = 9)

suppressWarnings(print(plot_combi))
Show code
#ggsave("../Figs/additional_network_properties.pdf", plot = plot_combi, height = 4, width = 10)

Average deegree of neighbours (\(k_{nn}\)) (Suppl. Fig. 3e)

Two methods additionally employed to test dissortativity and prove its derivatives from noth high-throughput and curated PPI. Implementation of average degree of neighbours to observe the pattern of connectivity as a function of node degrees.

Show code
ppi_degree <- degree(g$PPI_HIPPIE)
ppi_degree = ppi_degree[ppi_degree>0]

p_k = degree_distribution(g$PPI_HIPPIE)
p_k = p_k[-1]

# q_k =  remain degree distribution
q_k = (1:max(ppi_degree) +1)*c(p_k[-1],0)/mean(ppi_degree)
#q_k = ppi_degree-1

mean_q  = sum(1:max(ppi_degree)*q_k)
variance_q = sum((mean(ppi_degree)-1:max(ppi_degree))^2 * q_k)


neighour_degree_avg = knn(g$PPI_HIPPIE)$knn 
neighour_degree_avg = neighour_degree_avg[names(ppi_degree)]

local_assortativity = (ppi_degree-1)*ppi_degree*(neighour_degree_avg - 1 - mean_q)/(2*ecount(g$PPI_HIPPIE)*variance_q)


# Load bioplex subsets
bioplex_baits_293T <- read_tsv("../data/raw_data/Bioplex_293T-baits.tsv")
bioplex_baits_293T$cell = "293T"
bioplex_baits_HCT116 <- read_tsv("../data/raw_data/Bioplex_HCT116-baits.tsv")
bioplex_baits_HCT116$cell <- "HCT116"

bioplex_baits <- rbind(bioplex_baits_293T, bioplex_baits_HCT116)

## compute degree and assortativity
cuts <- c(0,1,2,5, 10,20, 50,100, 200, 500,1000,5000)
citation_assortativity = full_join(citation_count,
                                   tibble(gene = names(ppi_degree), 
                                          local_assortativity = local_assortativity,
                                          degree = ppi_degree,
                                          neighbour_degree = neighour_degree_avg,
                                          degree_binned = cut(degree, breaks = cuts, labels = cuts[-1]),
                                          label = ifelse(local_assortativity < -0.002, gene, ""))) %>%
  filter(!is.na(local_assortativity)) %>%
  mutate(inBioPlexbait = gene %in% bioplex_baits$`Bait Symbol`)


sumstat <- citation_assortativity %>% group_by(inBioPlexbait) %>% summarise(summary = list(summary(neighbour_degree)))
#sumstat[1,"summary"][[1]]
#sumstat[2,"summary"][[1]]

p_count_neighbour <- ggplot(citation_assortativity, aes(x = count, y = neighbour_degree)) + 
  geom_point(alpha = 0.1, col = "grey75") + theme_cowplot() + 
  geom_point(data = citation_assortativity %>% 
               group_by(count) %>% 
               summarise(mean = mean(neighbour_degree)), 
             aes(x = count, y = mean), col = "red", shape = 4)  + 
  scale_y_log10() + 
  scale_x_log10() +
  xlab("# PubMed Count") + 
  ylab(expression("<"~italic(k[nn])~"("~italic(k)~")>")) + 
  ggtitle("",#"PubMed count vs its PPI neighbours", 
          subtitle = paste0("Pearson's r:", round(cor(citation_assortativity %>% 
                                                        filter(!is.na(count), !is.na(neighbour_degree)) %>% pull(count), citation_assortativity %>% 
                                                        filter(!is.na(count), !is.na(neighbour_degree)) %>% pull(neighbour_degree)), 3)))



p_degree_neighbour <- ggplot(citation_assortativity, aes(x = degree, y = neighbour_degree)) + 
  geom_point(alpha = 0.1, col = "grey75") + 
  theme_cowplot() + 
  geom_point(data = citation_assortativity %>% 
               group_by(degree) %>% 
               summarise(mean = mean(neighbour_degree)), 
             aes(x = degree, y = mean), col = "red", shape = 4)  + 
  scale_y_log10() + 
  scale_x_log10() +
  xlab(expression(italic(k))) + ylab(expression("<"~italic(k[nn])~"("~italic(k)~")>")) + 
  ggtitle("",#"PPI degrees vs its neighbours", 
          subtitle = paste0("Pearson's r:", round(cor(ppi_degree, neighour_degree_avg), 3)))

p_assortativity <- p_count_neighbour + p_degree_neighbour
#ggsave("../Figs/local_assortativity_PPI.pdf", plot = p_assortativity, width = 8, height = 3)

p_assortativity

Local assortativity on BioPlex networks

Showing that dissortativity can be dervied from large-scale studies too due to bait selection.

Show code
p_local_assortativity <- ggplot(citation_assortativity, aes(x = degree, y = local_assortativity, col = inBioPlexbait))+ 
  geom_point(alpha = 0.4) + 
  scale_x_log10() + 
  theme_cowplot()  + 
  guides(col = F) + 
  xlab("Degree") + 
  ylab("Local assortartivity") + 
  ggrepel::geom_text_repel(aes(label = label)) + 
  scale_color_brewer(type = "qual", direction = -1) + 
  facet_grid(.~inBioPlexbait)

#ggsave("../Figs/local_assortativity_colored_by_bioplex.pdf", plot = p_local_assortativity, width = 4, height = 3)
p_local_assortativity

Show code
p_local_assortativity_bioplex <- ggplot(citation_assortativity,  aes(x = inBioPlexbait, y= neighbour_degree, fill = inBioPlexbait)) + 
  geom_boxplot()  + 
  theme_cowplot()  + 
  guides(col = F) + 
  scale_y_log10() + 
  ggpubr::stat_compare_means() + 
  theme_cowplot()  + 
  guides(fill = F) + 
  scale_fill_brewer(type = "qual", direction = -1) + 
  xlab("BioPlex baits") + ylab("Average neighbour degree")

#ggsave("../Figs/local_assortativity_colored_by_bioplex_box.pdf", plot = p_local_assortativity_bioplex, width = 4, height = 3)
p_local_assortativity_bioplex

Show code
# 

# Load bioplex subsets
bioplex_baits_293T <- read_tsv("../data/raw_data/Bioplex_293T-baits.tsv")
bioplex_baits_293T$cell = "293T"
bioplex_baits_HCT116 <- read_tsv("../data/raw_data/Bioplex_HCT116-baits.tsv")
bioplex_baits_HCT116$cell <- "HCT116"

g_bioplex_293T <- read_tsv("../data/raw_data/Bioplex_293T-interactions.tsv", col_types = "-c-c") %>% graph_from_data_frame(.,directed = "F")
g_bioplex_HCT116 <- read_tsv("../data/raw_data/Bioplex_HCT116-interactions.tsv", col_types = "-c-c") %>% graph_from_data_frame(.,directed = "F")


assortativity_nominal(g_bioplex_293T, types =  ifelse(V(g_bioplex_293T)$name %in% bioplex_baits_293T$`Bait Symbol`,1, 2), directed = F)
[1] -0.3033449
Show code
assortativity_nominal(g_bioplex_HCT116, types =  ifelse(V(g_bioplex_HCT116)$name %in% bioplex_baits_HCT116$`Bait Symbol`,1, 2), directed = F)
[1] -0.4904884
Show code
annotate_assortativity = function(graph){
  
    g_degree <- degree(graph)
    g_degree = g_degree[g_degree>0]
    
    p_k = degree_distribution(graph)
    p_k = p_k[-1]
    
    
    q_k = (1:max(g_degree) +1)*c(p_k[-1],0)/mean(g_degree)
    #q_k = ppi_degree-1
    
    mean_q  = sum(1:max(g_degree)*q_k)
    variance_q = sum((mean(g_degree)-1:max(g_degree))^2 * q_k)
    
    neighour_degree_avg = knn(graph)$knn 
    neighour_degree_avg = neighour_degree_avg[names(g_degree)]
    
    local_assortativity = (g_degree-1)*g_degree*(neighour_degree_avg - mean_q)/(2*ecount(graph)*variance_q)
    

    citation_assortativity = full_join(citation_count,
      tibble(gene = names(g_degree), 
             local_assortativity = local_assortativity,
             degree = g_degree,
             neighbour_degree = neighour_degree_avg,
             label = ifelse(local_assortativity < -0.002, gene, ""))) %>%
      filter(!is.na(local_assortativity)) 

  return(citation_assortativity)
}


citation_assortativity_293T <- annotate_assortativity(g_bioplex_293T)
citation_assortativity_293T <- citation_assortativity_293T %>%
  mutate(cell = "293T",
         inBioPlexbait = gene %in% bioplex_baits_293T$`Bait Symbol`)
citation_assortativity_HCT116 <- annotate_assortativity(g_bioplex_HCT116) %>%
  mutate(cell = "HCT116",
         inBioPlexbait = gene %in% bioplex_baits_HCT116$`Bait Symbol`)

citation_assortativity_bioplex = rbind(citation_assortativity_HCT116, citation_assortativity_293T) %>%
  mutate(local_assortativity_cut = cut(local_assortativity, breaks = c(-1,-0.004,-0.002,0,1)),
         inBioPlexbait = factor(inBioPlexbait, levels = c(F,T), labels = c("non bait", "baits")))

p_bioplex_cell_assortaticity <- ggplot(citation_assortativity_bioplex, aes(x = degree, y = local_assortativity, col = inBioPlexbait))+ geom_point() + scale_x_log10() + theme_cowplot()  + guides(col = F) + xlab("Degree") + ylab("Local assortartivity")  + facet_grid(cell~inBioPlexbait) + scale_color_brewer(type = "qual", direction = -1) 

#ggsave("../Figs/local_assortativity_colored_by_bioplex_bioplexNet.pdf", plot = p_bioplex_cell_assortaticity, width = 4, height = 3)

p_bioplex_cell_assortaticity

Disease associaton of PPI subsets

Show code
p_load(ggforestplot)
rare_genetic_result_folder = "../cache/output/Orphageneset_rare_additionalNetworks//"
rare_genetic_heatmap_file = "../cache/heatmap_network_disease_association_orphanets_genetic_rare_diseases_additionalNetworks.pdf"


# for printing (cm)
A4width = 21
A4height = 29.7


# defines colours used in the heatmap
cols = RColorBrewer::brewer.pal(5, "Blues")
cols = cols[2:5]

# load  network labels
network_info = read_tsv("../data/network_details.tsv")
source("../functions/process_LCC_result.R")


  ## process the results LCC significance
  result_df = readRDS(paste0(rare_genetic_result_folder, "LCC_and_distance_calculation_results.RDS"))
  
  processed_result_df = process_LCC_result(result_df)
  
  processed_result_df <- processed_result_df %>% select(-c(type, main_type, subtype, source)) 
  
  processed_result_df <- left_join(processed_result_df, g_prop_df %>% select(-c(property, value, alphaval)) %>% distinct())
  
  p <- processed_result_df  %>% dplyr::filter(LCC.signif != "none") %>%
    ggplot(aes(name, network)) + geom_tile(fill = "white") + #facet_grid(.~main_type, space = "free", scale = "free")  +## to get the rect filled
    # geom_point(aes(size = N_in_graph*1.7),  colour = LCC.signif)  +   ## geom_point for circle illusion
     geom_stripes(odd = "grey90", even = "#00000000") +
  #  theme_light() + 
   # theme_forest() +
    theme_minimal_hgrid() +
    theme(panel.spacing = unit(0.25, "lines"),
          axis.text.x = element_text(angle = 90, hjust = 1), 
          panel.grid.major.y = element_line(colour="grey", linetype="dotted")) +
      geom_point(aes(
      size = log10(N_in_graph),
      fill = LCC.signif,
      colour = LCC.signif), alpha = 0.7)  + 
    scale_color_manual(values = cols) +
    # scale_size(range = c(1, 10))+             ## to tune the size of circles
      coord_flip() +
    labs(x="", y="") + guides(size = F)
  
  
p

It shows that PPI HuRI alone is very sparse to detect disease module in the first place.

Literature bias analysis

Investigating gene properties derived from curated databases in terms of literature bias (Suppl. Fig. 4).

Collect gene features from four example datasets

  1. Popularity - via citation counts
  2. Number of HPO terms matching genes
  3. Number of associated GO terms (BP and MF)
  4. Number of associated reactome pathways
  5. Brain expression (as reference for non-curated data)
Show code
# Downloaded the precomputed gene features (count of annotated terms or expression level)
source("../functions/readdata_functions.R")
if(!file.exists("../cache/gene_features.tsv")){
  source("../source/collect_gene_features.R")
}

gene_feature_df <- read_tsv(file = "../cache/gene_features.tsv")

head(gene_feature_df)
# A tibble: 6 x 12
  gene  count PubMedRank AllBrainAvg FrontalCortex n_pathways BP_terms
  <chr> <dbl>      <dbl>       <dbl>         <dbl>      <dbl>    <dbl>
1 TP53   9274          1       4.37          3.84          67       54
2 TNF    5633          2       0.115         0.112         10       88
3 EGFR   5313          3       4.72          5.41          45       78
4 VEGFA  4415          4      20.5          19.5           14       69
5 IL6    4312          5       0.435         0.347         13       91
6 APOE   4245          6     984.          575.            17       67
# … with 5 more variables: MF_terms <dbl>, AllGO_terms <dbl>,
#   Informative_GOterms <dbl>, HP_terms <dbl>,
#   Informative_HP_terms <dbl>

Perform correlations among selected features

  1. Feature count - PubMed Count
  2. Feature count - Network degree
  3. Network degree - PubMed count (Literature bias)
Show code
pacman::p_load(hexbin, cowplot, ggstatsplot, patchwork)


gene_feature_cor_df <- gene_feature_df %>% 
  filter(!is.na(count)) %>%
 # select(-PubMedRank) %>%
  filter(!is.na(count)) %>%
  pivot_longer(cols = colnames(gene_feature_df)[-c(1,2,3)], names_to = "feature", values_to = "n") %>%
  group_by(feature) %>%
  mutate(feature_rank = rank(-n))

gene_feature_cor_df$feature <- factor(gene_feature_cor_df$feature, 
                                      levels = c("AllBrainAvg", "FrontalCortex", "n_pathways", "BP_terms", "MF_terms", "AllGO_terms", "Informative_GOterms", "HP_terms", "Informative_HP_terms"),
                                      labels = c("Expression (Brain avg)", "Expression (frontal cortex)", "Reactome pathways", "# GO terms (BP)", "# GO terms (MF)", "# GO terms", "#Informative GO terms", "# HP terms", "# Informative HP terms"))


# select particular networks to plot as examples
selected_features <- c("# GO terms (BP)", "# GO terms (MF)","# HP terms", "Reactome pathways", "Expression (Brain avg)")


gene_feature_cor_df_plot <- gene_feature_cor_df %>% 
  filter(feature %in% selected_features) %>%
  mutate(feature  = factor(feature, levels = selected_features))


# load required networks
g <- process_graph_data(paste0("../data/network_edgelists/", c("coex_BRO", "GOBP", "GOMF", "HP", "reactome_copathway"), ".tsv"))


degrees <- lapply(g, function(x) tibble(degree = degree(x), gene = names(degree(x))))
names(degrees) <- c("Expression (Brain avg)", "# GO terms (BP)", "# GO terms (MF)",  "# HP terms", "Reactome pathways")
degree_df <- bind_rows(degrees, .id = "feature")


gene_feature_cor_df_plot <- inner_join(gene_feature_cor_df_plot, degree_df) %>%
  filter(!is.na(feature))


# PLot 1: feature and PubNed
p_pubmed_feature <- ggplot(gene_feature_cor_df_plot, aes(y=count, x=n)) + 
  geom_hex(bins = 30) + 
  scale_fill_viridis_c() +
 # scale_fill_gradient(low = "white", high = "navy") +
  geom_smooth(method = 'lm', se = F, color = 'red', linetype = "dashed") +
  #ggrepel::geom_text_repel(aes(label = lab))+
  ggpubr::stat_cor(method = "spearman",  r.accuracy = 0.01, col = "white") +
  facet_grid(.~feature, scales = "free") +
  theme_cowplot() + 
  scale_x_log10() + 
  scale_y_log10() +
  ylab("# PubMed") +
  xlab("Feature count (Expression level | # terms)") +
  theme(
  panel.background = element_rect(fill = "#440154FF",
                                colour = "#440154FF",
                                size = 0.5, linetype = "solid"))


# PLot 2
p_feature_degree <- ggplot(gene_feature_cor_df_plot, aes(y=degree, x=n)) + 
  geom_hex(bins = 30) + 
  scale_fill_viridis_c() +
 # scale_fill_gradient(low = "white", high = "navy") +
  geom_smooth(method = 'lm', se = F, color = 'red', linetype = "dashed") +
  #ggrepel::geom_text_repel(aes(label = lab))+
  ggpubr::stat_cor(method = "spearman", r.accuracy = 0.01, col = "white") +
  facet_grid(.~feature, scales = "free") +
  theme_cowplot() + 
  scale_x_log10() + 
  scale_y_log10() +
  ylab("Network degree") +
  xlab("Feature count (Expression level | # terms)") +
  theme(
  panel.background = element_rect(fill = "#440154FF",
                                colour = "#440154FF",
                                size = 0.5, linetype = "solid"))

# PLot 3
p_social_bias_degree <- ggplot(gene_feature_cor_df_plot, aes(y=degree, x=count)) + 
  geom_hex(bins = 30) + 
  scale_fill_viridis_c() +
 # scale_fill_gradient(low = "white", high = "navy") +
  geom_smooth(method = 'lm', se = F, color = 'red', linetype = "dashed") +
  #ggrepel::geom_text_repel(aes(label = lab))+
  ggpubr::stat_cor(method = "spearman", r.accuracy = 0.01, col = "white") +
  facet_grid(.~feature, scales = "free") +
  theme_cowplot() + 
  scale_x_log10() + 
  scale_y_log10() +
  ylab("Network degree") +
  xlab("# PubMed") +
  theme(
  panel.background = element_rect(fill = "#440154FF",
                                colour = "#440154FF",
                                size = 0.5, linetype = "solid"))



p <- p_pubmed_feature/p_feature_degree/p_social_bias_degree

#ggsave("../Figs/hexbin_plot_network_degree_vs_pubmed_count_vs_feature_count_v2.pdf", p, width = 12, height = 8)

suppressMessages(p)

Co-expression network characterisation (Fig. 1e)

Core- and tissue-specific co-expression

We characterised our tissue-specific co-expression networks based on GTEx. Our hypothesis is that genes that are highly co-expression across all tissues are likely required for cellular developments and survival, and should show a strong correlation with of essentiality. In this analysis, we downloaded the list of human essential genes from the OGEE database (v2), also included in /data/OGEE_esential_genes_20190416.txt.

Show code
## coexpresssion - share edges
## goal: to observe whether the shared edges among co-expression networks are essential

# 0 - load required data
## load coexel sum
library(pacman)
p_load(tidyverse, cowplot, knitr)

coex_el_sum_grouped = readRDS("../cache/coexpression_edge_counts_by_group.RDS")

# create a vector of total probability for each class
coex_el_sum_score  = coex_el_sum_grouped %>% 
  ungroup() %>% 
  group_by(essential_edge_score) %>% 
  summarise(count = sum(n)) %>% pull(count, name = essential_edge_score)

coex_el_sum_grouped %>% 
  dplyr::rename(n_tissues = n_binned_relabel) %>%
  group_by(n_tissues) %>%
  summarise(n_edges = sum(n), percent =  sum(n)*100/sum(coex_el_sum_score)) %>%
  kable
n_tissues n_edges percent
0-5 12056143 91.8978690
5-10 747278 5.6961215
10-15 209979 1.6005635
15-20 70745 0.5392533
20-25 23372 0.1781529
25-30 7347 0.0560025
30-38 4203 0.0320373

The plot for higher essentiality aas number of genes increased is shown below.

Show code
bar_essential = ggplot(coex_el_sum_grouped) + 
  geom_bar(aes(x = n_binned_relabel, y = n, fill = score), stat = "identity",  position="fill") + 
  xlab("Number of tissues") + ylab("Edge proportion") + theme_cowplot() + 
  theme(legend.position = "bottom", axis.text.x = element_text(size = 10)) + 
  guides(fill = guide_legend(title = "Essential gene in edge")) + scale_fill_manual(values = c('grey60','#9ecae1','#3182bd'))

bar_essential
Show code
#ggsave("./Figs/essentiality_co-expression.pdf", bar_essential, width = 4, height = 4)
Show code
coex_el_sum = readRDS("../cache/coexpression_raw_edge_counts.RDS") %>% ungroup
coex_el_sum_by_tissue <- coex_el_sum %>% 
  count(n, essential_edge_score, name = "count")

Identifying cross-scale network signatures of rare diseases

The structure of Orphanet Rare Disease Ontology was queried and processed using R interface of the Ontology Lookup Service (https://lgatto.github.io/rols/index.html). A number of calculation per-computed for further analyses on this section was performed in source/Orphanet_annotate_genes_to_ancestors.R.

Individual gene-disease association (Suppl. Data 4)

Show code
# load direct gene association
orpha_gene_onset_df <- readRDS("../cache/orpha_gene_onset_df.RDS")

# disease gene association with roots
orphanet_gene_association <- read_tsv("../data/orphaNet_disease_gene_association_with_roots.tsv")

# disease gene association at group level
source("../functions/readdata_functions.R")
gene_disease_orpha = process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", 1, 2000)
[1] "read 28 diseases, of total 3593 associated genes."
Show code
#source("../source/read_orphanet_gene_association_data.R")
gene_disease_orpha = gene_disease_orpha$disgene_df

# Modify and merge data
orpha_gene_display_df <- orphanet_gene_association %>%
  dplyr::filter(n_genes > 0) %>%
  mutate(ID = as.double(str_remove(orphaID, "Orphanet:"))) 

DT::datatable(orpha_gene_display_df[,c("ID", "label", "n_genes", "genes")] ,
                extensions = 'Buttons',
                options = list(dom = 'Blfrtip',
                               buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                               lengthMenu = list(c(10,25,50,-1),
                                                 c(10,25,50,"All"))))

Rare disease gene association is scarced and needs to be analysed as a group (Fig. 2a)

Rare diseases are scarcely annotated, and most disease terms (2686 out of 3771) are only associated with one gene. Network-based measurements for individual diseases are unfeasible and grouping of the terms for higher level association are necessary.

Show code
# from orphanet_mapping_top_branch
gene_per_disease = orpha_gene_onset_df %>% 
  dplyr::filter(!is.na(gene)) %>% 
  count(orphaID)

gene_per_disease_count = gene_per_disease %>% 
  mutate(group = cut(n, breaks = c(0:10, 100), labels = c(1:10, "> 10"))) %>% 
  count(group)

p = ggplot(gene_per_disease_count, aes(x = group, y =n)) + geom_col() +
  ggtitle("Most orphanet diseases are immediately associated with one gene") +
  theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()

plotly::ggplotly(p)
Show code
pacman::p_load(cowplot)
#gene_per_disease_group = gene_disease_orpha %>% 
 # mutate(group = cut(n_genes, breaks = c(seq(0,100,20), seq(200,1000, 200), Inf))) %>% count(group)

#ggplot(gene_per_disease_group, aes(x = group, y =n)) + geom_col() +
#  ggtitle("Most orphanet diseases are immediately associated with one gene") +
#  theme_minimal() + ylab("Number of diseases") + xlab("Number of genes per disease")#+ scale_y_log10()

# number shift
gene_per_disease_group = gene_disease_orpha %>% 
  mutate(disease = "Grouped", n = N) %>% 
  select(disease, n)

gene_per_disease = gene_per_disease %>% 
  mutate(disease = "Individual")

gene_per_disease_both = gene_per_disease %>% 
  select(disease, n) %>%
  bind_rows(., gene_per_disease_group) %>% 
  dplyr::filter(n>0) %>%
  # relevel factor
  mutate(disease = factor(disease, levels = c("Individual","Grouped")))

# add break values for plotting on log scale on x axis
breakvals = c(0:9, seq(10,90,10), seq(100,1000,100), 2000)

gene_per_disease_both_count = gene_per_disease_both %>% 
  mutate(group = cut(n, breaks = breakvals, include.lowest = T, labels = breakvals[-1])) %>%
  group_by(disease, group) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n/sum(n)) %>% 
  full_join(., tibble(group = as_factor(breakvals[-1])))


gene_per_disease_both_count$disease[is.na(gene_per_disease_both_count$disease)] = "Grouped"
gene_per_disease_both_count$n[is.na(gene_per_disease_both_count$n)] = 0
# plot
#ggplot(gene_per_disease_both_count, aes(y = prop, x = group)) + geom_col() +  theme_minimal() +facet_grid(disease ~ .)

p = ggplot(gene_per_disease_both_count, aes(y = n, x = group)) + 
  geom_col() +  
  facet_grid(disease ~ ., scales = "free") +
  #scale_x_discrete(guide = guide_axis(check.overlap = TRUE)) +
  scale_x_discrete(breaks = c(1,10,100,1000))+
  geom_vline(xintercept = which(levels(gene_per_disease_both_count$group)==20), linetype = "dashed", col = "red") + theme_cowplot() + xlab("Genes per disease term") + ylab("Number of terms")

p
Show code
#ggsave("../Figs/orphanet_individual_vs_grouped_diseases_bar.pdf", p, width = 3*1.5, height = 2*1.5)

Based on the plot above, accumulating gene-disease association for descendant disease terms of ‘Rare genetic disease’ ( “Orphanet:98053”) resulting in physiologically distinct disease groups where the majority (26 out of 28) groups are associated with sufficient amount of genes for module detection (\(n=20\)).

Grouping rare disease genes (Suppl. Data 5)

The disease gene association can be found in data/table_disease_gene_assoc_orphanet_genetic.tsv. The summary of disease groups and all associated genes are shown below.

Show code
gene_disease_orpha_df <- gene_disease_orpha %>% select(name, N) %>% arrange(-N)
rmarkdown::paged_table(gene_disease_orpha_df) 

Consistent specificity across disease groups (Suppl. Fig. 2a)

Despite the wide range of associated genes, the average number of genes per disease term remains comparable across all disease groups, ensuring similar level of disease specificity across the disease domain.

Show code
root_and_gene_df <- orpha_gene_display_df %>%
  select(ID, genes, roots) %>%
  separate_rows(roots, sep = ";") 
  
n_term_per_group <- root_and_gene_df %>%
  distinct(ID, roots) %>%
  count(roots, name = "n_terms")

gene_disease_orpha_df$pass <- gene_disease_orpha_df$N >= 20

disease_characteristics <- left_join(gene_disease_orpha_df, n_term_per_group, by = c("name"="roots")) %>%
  mutate(genes_per_term = N/n_terms) %>%
  pivot_longer(!c(name, pass), names_to = "property", values_to = "count") %>%
  mutate(name = factor(name, levels = rev(gene_disease_orpha_df$name)),
         property = factor(property, levels = c("N", "n_terms", "genes_per_term"), 
                           labels = c("# genes", "# disease terms", "# genes/term"))) 



p <- ggplot(disease_characteristics, aes(x = name, y = count, fill = pass)) + 
  geom_col() + 
  facet_grid(.~property, scales = "free") +
  theme_cowplot() +
  coord_flip() +
  scale_fill_manual(values = c("#bdbdbd", "#1c9099")) +
  guides(fill = FALSE)

#ggsave("../Figs/barplot_disease_characteristics.pdf",p, width = 10, height = 5)

p

The summary statistics for each of these properties are as follows:

Show code
disease_characteristics %>% group_by(property) %>% summarise(mean = mean(count))
# A tibble: 3 x 2
  property          mean
* <fct>            <dbl>
1 # genes         339.  
2 # disease terms 262.  
3 # genes/term      1.74

Unique representation of the 26 disease groups (Suppl. Fig. 2b)

Even though some disease groups may contain some overlap terms, 90.5% of disease pairs are distinctive (Jaccard Index < 0.1) and therefore represent unique disease definition

Show code
removed_roots <- gene_disease_orpha %>% filter(N<20) %>% pull(name)

terms_per_root <- root_and_gene_df %>%
  filter(!roots %in% removed_roots) %>%
  group_by(roots) %>%
  summarise(IDs = list(unique(ID))) %>%
  filter(!is.na(roots)) %>%
  pull(IDs, name = roots)

genes_per_root <- gene_disease_orpha %>%
  pull(genes_all, name = name)

terms_pairwise_df <- combn(names(terms_per_root), 2) %>% t %>% as.tibble() 

jaccard <- function(x1, x2){
  length(intersect(x1, x2))/length(union(x1, x2))}

overlap <- function(x1, x2){
  length(intersect(x1, x2))/min(length(x1), length(x2))}

terms_pairwise_df$jaccard_terms <- apply(terms_pairwise_df, 1, function(x) jaccard(terms_per_root[[x[1]]], terms_per_root[[x[2]]]))
terms_pairwise_df$jaccard_genes <- apply(terms_pairwise_df, 1, function(x) jaccard(genes_per_root[[x[1]]], genes_per_root[[x[2]]]))
terms_pairwise_df$overlap_terms <- apply(terms_pairwise_df, 1, function(x) overlap(terms_per_root[[x[1]]], terms_per_root[[x[2]]]))
terms_pairwise_df$overlap_genes <- apply(terms_pairwise_df, 1, function(x) overlap(genes_per_root[[x[1]]], genes_per_root[[x[2]]]))

terms_pairwise_df <- terms_pairwise_df %>%
  mutate(label = ifelse(jaccard_terms>0.2, str_remove_all(paste(V1, V2,sep = " & "), "Rare |genetic | disorder| diseases| disease| during embryogenesis|inborn errors of |and obstetrical"),""))

p = ggplot(terms_pairwise_df, aes(x=jaccard_terms, y=jaccard_genes, label = label)) + 
  geom_point(aes(col = jaccard_terms), alpha = 0.7) +
  scale_color_viridis_c() +
  theme_cowplot() +
  xlab("Jaccard (disease terms)") +
  ylab("Jaccard (disease genes)") +
  ggrepel::geom_text_repel() + guides(col = F)

#ggsave("../Figs/scatter_disease_similarity_jaccard.pdf", p, width = 5, height = 5)

p

Omitted disease groups

Omitting disease groups with fewer than 20 associated genes, the number of terms (out of 3771) and genes excluded from further analyses include:

Show code
# table for removed terms
removed_disease_terms <- orphanet_gene_association %>%
  mutate(n_roots = str_count(roots, ";")+1,
         removed_roots = grepl(removed_roots[1], roots) + grepl(removed_roots[2], roots),
         remained_roots = n_roots - removed_roots) %>%
  filter(remained_roots == 0) %>%
  select(orphaID, label, genes, roots)

removed_genes <- sapply(removed_disease_terms$genes, function(x) str_split(x, ";")) %>% unlist %>% unique()

knitr::kable(removed_disease_terms)
orphaID label genes roots
Orphanet:199241 Pulmonary capillary hemangiomatosis EIF2AK4 Rare genetic respiratory disease
Orphanet:210122 Congenital alveolar capillary dysplasia FOXF1 Rare genetic respiratory disease
Orphanet:217566 Chronic respiratory distress with surfactant metabolism deficiency SFTPC Rare genetic respiratory disease
Orphanet:217563 Neonatal acute respiratory distress due to SP-B deficiency SFTPB Rare genetic respiratory disease
Orphanet:440402 Interstitial lung disease due to ABCA3 deficiency ABCA3 Rare genetic respiratory disease
Orphanet:440392 Interstitial lung disease due to SP-C deficiency SFTPC Rare genetic respiratory disease
Orphanet:444092 Autoimmune interstitial lung disease-arthritis syndrome COPA Rare genetic respiratory disease
Orphanet:31837 Pulmonary venoocclusive disease BMPR2;EIF2AK4 Rare genetic respiratory disease
Orphanet:100051 Hereditary angioedema type 2 SERPING1 Serpinopathy
Orphanet:100050 Hereditary angioedema type 1 SERPING1 Serpinopathy

There are 10 disease terms whose associated genes are not associated with any other disease groups, and hence these 8 genes are omitted.

Tree map for ontology representation of rare diseases (Fig. 2b)

Show code
# download the association
library(RColorBrewer)

orphanet_gene_association_unique_root <- orphanet_gene_association %>%
  separate_rows(., roots, sep = ";", convert = T) %>% dplyr::filter(!is.na(roots)) %>% 
  mutate(roots = as.factor(roots))

## take a function to allow modifying alpha values
# Add an alpha value to a colour
add.alpha <- function(col=NULL, alpha=1){
  if(missing(col))
    stop("Please provide a vector of colours.")
  apply(sapply(col, col2rgb)/255, 2, 
                     function(x) 
                       rgb(x[1], x[2], x[3], alpha=alpha))  
}

# define colours for all disease groups
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nrow(gene_disease_orpha))

# add id 1-28
gene_disease_orpha$id = 1:nrow(gene_disease_orpha)

# add colour with corresponding alpha values to each disease group
gene_disease_orpha_mod <- gene_disease_orpha %>% 
  rowwise() %>%
  mutate(col =mycolors[id])
  #mutate(col = add.alpha(mycolors[id], N/max(gene_disease_orpha$N)))

gene_disease_orpha_mod$root = "Orphanet"

# load voronoi treemap package
if(!"voronoiTreemap" %in% rownames(installed.packages())){
  pacman::p_load_gh("https://github.com/uRosConf/voronoiTreemap")
} else{
  library(voronoiTreemap)
}


gene_disease_orpha_mod$plotlab = str_remove_all( gene_disease_orpha_mod$name, "Rare |genetic | disease| syndrome| disorder")
onto_json <- vt_export_json(vt_input_from_df(gene_disease_orpha_mod, hierachyVar0 = "root", hierachyVar1 = "name", hierachyVar2 = "name", colorVar = "col", weightVar = "N", labelVar = "plotlab"))

vt_d3(onto_json)

Disease-network landscape (Fig. 2c,d)

The node2vec embedding was performed to allow visualization of large networks in small space.

Show code
# only run this chuck to recompute all the coordinate values

#embed_result_dir = "./embedded_results/"
embed_result_dir = "../data/network_node2vec_results//"  #only the new coexpression networks
embed_files = list.files(embed_result_dir, recursive = T)

# dim reduction
library(uwot)
library(Rtsne)

for(i in embed_files){
  print(i)
  df = read_delim(paste0(embed_result_dir, i), delim = " ", skip = 1, col_names = F) 
  df = df[apply(df, 1, function(x) !any(is.na(x))),]
  df = column_to_rownames(df, var = "X1")
  
  print("UMAP")
  umap_results = uwot::umap(df, n_neighbors = 15)
  print("PCA")
  pca_results = pcaMethods::pca(df)
  print("tsne")
  tsne_results = Rtsne::Rtsne(X = df, dims=2)
  
  umap_results_df = tibble(X = umap_results[,1], Y = umap_results[,2])
  tsne_results_df = tibble(X = tsne_results$Y[,1], Y = tsne_results$Y[,2])
  pca_results_df = tibble(X = pca_results@scores[,1], Y = pca_results@scores[,2])
  
  umap_results_df$name = tsne_results_df$name = pca_results_df$name = rownames(df)
  
  write_tsv(umap_results_df, paste("../cache/embedded_results_2D/", i, "umap.tsv", sep = "_"), col_names = T)
  write_tsv(tsne_results_df, paste("../cache/embedded_results_2D/", i, "tsne.tsv", sep = "_"), col_names = T)
  write_tsv(pca_results_df, paste("../cache/embedded_results_2D/", i, "pca.tsv", sep = "_"), col_names = T)
}
Show code
for(d in alldiseases){
  disease = orpha$disgene_list[[d]]
  
  for(n in embed_files){
    network_name = str_replace(n, "coex/", "")
    
    # load the node2vec embedded results
    tsne_results_df = read_tsv(paste("../cache//embedded_results_2D/", network_name, "tsne.tsv", sep = "_"), col_types = 'ddc') %>% 
      mutate(indisease = name %in% disease)

    n <- str_replace(n, "/", "_") # replace / by _ for labelling
    
    # create directory to store results
    dir.create(file.path(paste0(embedding_fig_dir, "/by_disease"),  d), showWarnings = FALSE)
    dir.create(file.path(paste0(embedding_fig_dir, "/by_network"),  n), showWarnings = FALSE)
 
    
    
    p <- ggplot(tsne_results_df)  + 
      stat_density_2d(aes(X,Y,  fill = stat(level)), geom = "polygon") + 
      theme(panel.background = element_rect(fill = '#0f2030'),
            axis.line=element_blank(),axis.text.x=element_blank(),
            axis.text.y=element_blank(),axis.ticks=element_blank(),
            axis.title.x=element_blank(),
            axis.title.y=element_blank(),legend.position="none",
            panel.border=element_blank(),panel.grid.major=element_blank(),
            panel.grid.minor=element_blank(),plot.background=element_blank())+ 
      geom_point(aes(X,Y, alpha = indisease), col = "white", size = 1) +
      scale_alpha_discrete(range = c(0, 0.5)) + guides(fill = FALSE, alpha = FALSE, col = FALSE)+
      ggtitle(paste0(d,": ",n))
    
    ggsave(filename = sprintf("%s/%s/%s/%s.pdf", embedding_fig_dir, "by_disease", d, n), plot = p, device = "pdf", width = 5, height = 5)
    ggsave(filename = sprintf("%s/%s/%s/%s.pdf", embedding_fig_dir, "by_network", n, d), plot = p, device = "pdf", width = 5, height = 5)
           
  }
}
Show code
# only run this chunk to replot
source("../functions/readdata_functions.R")

# folder to add figures
embedding_fig_dir = "../Figs/embedding/"

orpha = process_disease_genes_data("../data/table_disease_gene_assoc_orphanet_genetic.tsv", 20, 2000)
[1] "read 26 diseases, of total 3586 associated genes."
Show code
alldiseases = names(orpha$disgene_list)

tsne_results_df <- read_tsv("../cache//embedded_results_2D/_HP_tsne.tsv", col_types = 'ddc') 

selected_diseases <- c(
"Rare genetic immune disease", 
"Rare genetic cardiac disease", 
"Rare genetic renal disease",
"Rare genetic bone disease", 
"Rare genetic hematologic disease", 
"Rare genetic neurological disorder"
)
Show code
tsne_results_df <- read_tsv("../cache//embedded_results_2D/_HP_tsne.tsv", col_types = 'ddc') 

selected_diseases <- c(
"Rare genetic immune disease", 
"Rare genetic cardiac disease", 
"Rare genetic renal disease",
"Rare genetic bone disease", 
"Rare genetic hematologic disease", 
"Rare genetic neurological disorder"
)


embed_files_selected <- "HP"
embedding_fig_dir = "../Figs/embedding/"

embedplot <- list()

for(d in selected_diseases){
  disease = orpha$disgene_list[[d]]
  
  # create directory to store results
  dir.create(file.path(embedding_fig_dir, d), showWarnings = FALSE)
 
  embedplot[[d]] <- list()
  
  for(n in embed_files_selected){
    network_name = str_replace(n, "coex/", "")
    
    # load the node2vec embedded results
    tsne_results_df = read_tsv(paste("../cache//embedded_results_2D/", network_name, "tsne.tsv", sep = "_"), col_types = 'ddc') %>% 
      mutate(indisease = name %in% disease)

    n <- str_replace(n, "/", "_") # replace / by _ for labelling
    
    embedplot[[d]][[n]] <- ggplot(tsne_results_df)  + 
      stat_density_2d(aes(X,Y,  fill = stat(level)), geom = "polygon") + 
      theme(panel.background = element_rect(fill = '#0f2030'),
            axis.line=element_blank(),axis.text.x=element_blank(),
            axis.text.y=element_blank(),axis.ticks=element_blank(),
            axis.title.x=element_blank(),
            axis.title.y=element_blank(),legend.position="none",
            panel.border=element_blank(),panel.grid.major=element_blank(),
            panel.grid.minor=element_blank(),plot.background=element_blank())+ 
      geom_point(aes(X,Y, alpha = indisease), col = "white", size = 1) +
      scale_alpha_discrete(range = c(0, 0.5)) + guides(fill = FALSE, alpha = FALSE, col = FALSE)+
      ggtitle(paste0(d,": ",n))
           
  }
}

embedplot <- lapply(embedplot, function(x) x[[1]])
(embedplot[[1]] + embedplot[[2]] + embedplot[[3]])/(embedplot[[4]] + embedplot[[5]] + embedplot[[6]])

Quantifying network modularity of rare diseases

The computation was performed on bash script ./source/compute_localisation_analysis_allnetworks_orphanet_rare.sh and the saved results are shown below.

Show code
# library and configurations

pacman::p_load(tidyverse, ggrepel, knitr, cowplot)
pacman::p_load_gh("nightingalehealth/ggforestplot")


# load  network labels
network_info = read_tsv("../data/network_details.tsv")
source("../functions/process_LCC_result.R")

# defines colours used in the heatmap
cols = RColorBrewer::brewer.pal(5, "Blues")
cols = cols[2:5]


## a function for heatmap plotting
LCC_heatmap_plot = function(result_folder, heatmap_file){
  #' @input result_folder: folder where LCC files were saved
  #' @input heatmap_file: file where the heatmap is to be saved
  
  ## process the results LCC significance
  result_df = readRDS(paste0(result_folder, "LCC_and_distance_calculation_results.RDS"))
  
  processed_result_df = process_LCC_result(result_df)
  
  processed_result_df <- processed_result_df %>% mutate(LCC.signif = factor(LCC.signif, levels = rev(levels(LCC.signif))))
  
}

Heatmap for disease-network relevance (Fig. 3)

Show code
source("../source/cache_all_networks_and_disease_genes.R")

#rare_genetic_heatmap_file = "../cache/heatmap_network_disease_association_orphanets_genetic_rare_diseases.pdf"

heatmap_file <- "../Figs/heatmap_disease_network_specificity.pdf"

processed_result_df_plot <- processed_result_df  %>% 
  filter(LCC.signif != "none") %>%
  mutate(name_short = fct_recode(name, `Rare genetic gynecological diseases`="Rare genetic gynecological and obstetrical diseases", 
                           `Rare genetic developmental defect..`="Rare genetic developmental defect during embryogenesis",
                           `Rare genetic rheumatologic disease`="Rare genetic systemic or rheumatologic disease"),
         subtype = fct_recode(subtype, `Core (pan-tissue)` = "Core module"),
         LCC.signif = factor(LCC.signif, levels = rev(levels(LCC.signif))))

  

p <- ggplot(processed_result_df_plot, aes(name_short, subtype)) + geom_tile(fill = "white") + facet_grid(. ~
                                                                        main_type, space = "free", scale = "free")  + ## to get the rect filled
  # geom_point(aes(size = N_in_graph*1.7),  colour = LCC.signif)  +   ## geom_point for circle illusion
  #geom_stripes(odd = "grey92", even = "#00000000") +
    theme_light() +
  # theme_forest() +
  # theme_minimal_hgrid() +
  theme(
    panel.spacing = unit(0.25, "lines"),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
    panel.grid.major.y = element_line(colour = "grey", linetype =
                                        "dotted")
  ) +
  geom_point(aes(
    size = log10(N_in_graph),
    fill = LCC.signif,
    colour = LCC.signif
  ), alpha = 0.7)  +
  scale_color_manual(values = (cols)) +
  # scale_size(range = c(1, 10))+             ## to tune the size of circles
  coord_flip() +
  labs(x = "", y = "") + guides(size = F, fill = F)

# for printing (cm)
A4width = 21
A4height = 29.7
  
ggsave(heatmap_file, plot = p, width = 0.4*A4width, height = 0.175*A4height, scale = 1.5)

p
Show code
#knitr::include_graphics(rare_genetic_heatmap_file)

bar plot for the number of diseases/tissues

Show code
# number of significant diseases
library(scales)

ntissue_per_disease = processed_result_df_plot %>% 
  dplyr::filter(LCC.signif!="none") %>% 
  count(name_short, LCC.signif) #%>% mutate(name = factor(name, levels = levels_name))

p = ggplot(ntissue_per_disease, aes(x = name_short, y = n, fill = LCC.signif)) + 
  geom_bar(stat = "identity") +  
  scale_fill_manual(values = (cols)) +
  #theme_light()+
  theme_minimal_vgrid()  + 
  ylab("# Networks") + 
  coord_flip() + 
  scale_y_continuous(breaks= scales::pretty_breaks())

P_without_label <- p + theme(axis.text.y = element_blank()) + guides(fill = F)+xlab("")+scale_y_continuous(breaks = c(0,10,20))

#ggsave("../Figs/barplot_ntissue_per_disease_for_heatmap_updated.pdf",P_without_label,  width = 0.125*0.5*A4width, height = 0.2*A4height, scale = 1.1)

p

Show code
# number of significant diseases
ndisease_per_tissue = processed_result_df_plot %>% 
  count(network, LCC.signif, main_type, subtype) %>% 
  mutate(network = fct_reorder(network, n, sum, .desc = T),
         subtype = fct_reorder(subtype, n, sum, .desc = T)) 


p = ggplot(ndisease_per_tissue, aes(x = subtype, y = n, fill = LCC.signif)) + geom_bar(stat = "identity")  +  scale_fill_manual(values = cols) + theme_minimal_hgrid() + #theme_light() +
  ylab("# Disease groups") + scale_y_continuous(breaks= scales::pretty_breaks())  + facet_grid(.~main_type, space = "free", scale = "free") 

P_without_label <- p + theme(axis.text.x = element_blank())
ggsave("../Figs/barplot_ndisease_per_tissue_for_heatmap_updated.pdf",P_without_label, width = 1.1*0.5*A4width, height = 0.35*1.1*0.175*A4height)

p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Most significant layers per disease group

Show code
most_signif_per_disease_group <- processed_result_df %>% group_by(name) %>% mutate(rank = rank(-LCC.zscore)) %>% filter(rank ==1)
most_signif_per_disease_group$type[most_signif_per_disease_group$network=="ppi"] = "PPI"
most_signif_per_disease_group$type[most_signif_per_disease_group$network=="GOBP"] = "GO"
most_signif_per_disease_group
# A tibble: 26 x 16
# Groups:   name [26]
   name         network N_in_graph LCC.size LCC.mean LCC.sd LCC.zscore
   <fct>        <chr>        <dbl>    <dbl>    <dbl>  <dbl>      <dbl>
 1 Rare geneti… GOBP           141       29     7.60  2.31        9.27
 2 Rare inborn… GOBP           529      207    34.0  10.4        16.7 
 3 Rare geneti… HP              21       19     2.78  1.37       11.8 
 4 Laminopathy  HP              23       22     3.15  1.49       12.7 
 5 Rare geneti… MP             295      117    21.4   5.56       17.2 
 6 Rare geneti… MP              31       13     2.06  0.910      12.0 
 7 Rare geneti… MP              90       23     5.20  2.26        7.89
 8 Rare geneti… MP             115       33     6.69  2.83        9.29
 9 Rare geneti… MP             138       39     8.38  3.18        9.63
10 Ciliopathy   MP              95       55     5.38  2.24       22.2 
# … with 16 more rows, and 9 more variables: LCC.pval <dbl>,
#   correctedPval <dbl>, LCC.signif <ord>, minusLogpval <dbl>,
#   type <chr>, main_type <chr>, subtype <fct>, source <chr>,
#   rank <dbl>
Show code
p <- ggplot(most_signif_per_disease_group, aes(x=subtype)) + 
  geom_bar() + 
  facet_grid(.~type, space = "free", scales = "free") + 
  theme_cowplot()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
  

ggsave("../Figs/barplot_most_significant_networks.pdf",p, height = 4, width = 5)

Using differential modularity to contextualize rare disease gene clusters

A case study of microscopic analysis using our multiple networks framework. We explore the differential modularity of gastroenterological diseases as an example.

Show code
# load cached networks and diseases and LCC results
source("../source/cache_all_networks_and_disease_genes.R")

pacman::p_load(treemap, igraph, tidygraph, ggraph)

disease_of_interest <- "Rare genetic gastroenterological disease" 


# unique disease set under the term
individual_gene_set <- individual_diseases_genes %>% dplyr::filter(grepl(disease_of_interest, roots))

# all genes associated
genes_set <- rare_genetic_diseases_genes$disgene_list[[disease_of_interest]]

# list of all significant networks 
signif_nets <- processed_result_df %>% dplyr::filter(name == disease_of_interest, LCC.signif != "none") %>% arrange(desc(LCC.zscore)) %>% pull(network)

# degrees each gene in each network
degree_diseasegenes <- sapply(g[signif_nets], function(x) degree(x, v = intersect(V(x)$name, genes_set)))


# igraph object of significant networks for the disease group
disease_graph <- lapply(signif_nets, function(x) g[[x]]- setdiff(V(g[[x]])$name, genes_set))
names(disease_graph) <- signif_nets

Individual disease gene association

There are rnrow(individual_gene_set)` individual disease terms under Rare genetic gastroenterological disease, involving 140 genes.

Show code
# print the paged table
individual_gene_set %>% 
  select(label, genes, n_genes) %>% 
  arrange(-n_genes) %>%
  rmarkdown::paged_table()

Treemap association (Fig. 4)

The tree map for descendants of Rare genetic gastroenterological disease are as follows (limited to terms with at least two associated genes).

Show code
# set a threshold for minimal amount of genes required to include in the plotting
threshold = 2

individual_gene_set_wrap <- individual_gene_set %>% dplyr::filter(n_genes >= threshold) %>% select(label, n_genes)

n_removed <- sum(individual_gene_set$n_genes<=threshold)
#individual_gene_set_wrap <- rbind(individual_gene_set_wrap, data.frame(label = sprintf("Other (%i diseases with one or two associated genes)", n_removed), n_genes = n_removed))


# Create data
{group <- ifelse(individual_gene_set_wrap$n_genes > threshold+2, 
                individual_gene_set_wrap$label, str_trunc(individual_gene_set_wrap$label, 50, side = "right"))

value <- individual_gene_set_wrap$n_genes
data <- data.frame(group,value)
 

treemap(data,
        palette = "Set3",   
        title = disease_of_interest,
            index="group",
            vSize="value",
            type="index"
            )
}
Show code
# treemap - uncomment to plot
#pdf("../Figs/treemap_gastroenterological.pdf", width = 5, height = 4)

The module significance computation shows the following networks being significant:

Show code
processed_result_df %>% 
  dplyr::filter(name == disease_of_interest, LCC.signif != "none") %>% 
  arrange(desc(LCC.zscore)) %>%
  select(subtype, main_type, N_in_graph, LCC.size, LCC.mean, LCC.sd, LCC.zscore, correctedPval) %>%
  knitr::kable()
subtype main_type N_in_graph LCC.size LCC.mean LCC.sd LCC.zscore correctedPval
Protein-protein interaction Other 139 77 12.844 8.449086 7.593248 0.0000000
Human Phenotype Other 120 102 39.789 16.155292 3.850812 0.0004798
Esophagus mucosa Co-expression 99 50 19.179 9.010002 3.420754 0.0021025
Co-Pathway Membership Other 44 11 4.622 2.693022 2.368343 0.0353649
Co-essentiality Other 30 7 3.465 1.570108 2.251437 0.0458654

Differential modularity (Fig. 4)

The network differential modularity can be visualised as follows:

Show code
    # convert the igraph object into df for ggraph visualisation
disease_graph_df <- lapply(disease_graph, as_data_frame) %>% 
  bind_rows(., .id = "network") %>% 
  mutate(network = factor(network, levels = signif_nets))
    
    
disease_tidygraph <- as_tbl_graph(disease_graph_df) %>% 
  mutate(degree = centrality_degree(mode = 'all'))
    
tidygraph_ggrpah <- ggraph(disease_tidygraph, layout = 'kk')

 p <- tidygraph_ggrpah + 
          geom_edge_density(aes(fill = network),
                            show.legend = FALSE) + 
          geom_edge_fan(aes(alpha = stat(index), 
                            col = network), 
                          show.legend = FALSE) + 
          geom_node_point(col = "white",
                          alpha = 0.25,
                           show.legend  = F, 
                           size = 1) + 
          facet_edges(~factor(network, levels = signif_nets)) + 
          # black theme
            theme_graph(fg_text_colour = 'white', 
                      background = "black",
                      text_colour = "white") 
 
# uncomment when need to replot and save
#ggsave2(sprintf("../Figs/networks_%s.pdf", disease_of_interest), plot = p, width = 10, height = 6)
#ggsave2(sprintf("../Figs/networks_%s_separate_dimmed.png", disease_of_interest), plot = p, width = 10, height = 6) 

p

Modularity as quantification of pathobiological relevance

Note: this was the old version and the indexed one should be replaced by the new one.

Show code
pacman::p_load(patchwork, cowplot, ggstatsplot)
knitr::opts_chunk$set(echo=FALSE)

if(!file.exists("../cache/fold_cv_processed_results_revisioned_processed.RDS")){
  source("../source/process_10fold_retrieval_for_revision.R")
}
AUC_folds_summary_updated_plot <- readRDS("../cache/fold_cv_processed_results_revisioned_processed.RDS")

ROC-AUC comparison among all diseases and and network selection (Fig. 5b)

Show code
palettes <- c('#66c2a5','#8da0cb',"#fec44f" , '#fc8d62')

p <- ggplot(AUC_folds_summary_updated_plot,aes(x=name, fill = label, group = label)) +
   geom_ribbon(aes(ymin = Q25, ymax = Q75), alpha = 0.25) +
  geom_line(aes(y = med, col = label), linetype = 2) +
  coord_flip() + xlab("disease") + ylab("fold AUC") + theme_minimal_hgrid() + theme(legend.position = "bottom") +
  ggtitle("ROC-AUC comparison", subtitle = "area spans 25, 50 and 75% quantile") +
  #scale_fill_brewer(type = "qual") + scale_color_brewer(type = "qual")
  scale_fill_manual(values=palettes) + scale_color_manual(values=palettes)

#ggsave("../Figs/AUC_folds_comparison_short_names_updated.pdf", p, width = 6, height = 6)

p

Corresponding p-value based on statistical test for avg ROC curve (Fig. 5c)

Show code
AUC_folds_summary_updated_plot2 <- AUC_folds_summary_updated_plot %>%
  mutate(label_plot = fct_recode(label, `Single most\n relevant layer`="SingleMostSignif"),
         label_plot = fct_relevel(label_plot, "All layers", after = 1))

palettes2 <- c('#8da0cb','#66c2a5',"#fec44f" , '#fc8d62')

p_compare_main <- ggwithinstats(
  data = AUC_folds_summary_updated_plot2,# %>% filter(!excluded_layer %in% c("CoEx", "CoEssential")) %>% ungroup,
  x = label_plot,
  y = med,
  p.adjust.method = "fdr",
  #title = "AUC values",
#  caption = "Data source: `WRS2` R package",
  ggtheme = theme_cowplot(), 
  xlab = "Network layers",
sample.size.label = F,
pairwise.display = "significant",
  ylab = "AUC", 
#ggplot.component = list(scale_color_brewer(palette = "Blues"), scale_fill_brewer(palette = "Blues")),
type = "non-parametric"#, 
#ggplot.component = c(col = "grey")
#  ggtheme = ggthemes::theme_fivethirtyeight()
) + scale_color_manual(values = palettes2)

#ggsave(plot = p_compare_main, filename = "../Figs/10fold_cv_method_compared_boxplot_updated.pdf", width = 6.5, height = 4, scale = 0.9)

p_compare_main

relationship between the disease, number of signifiant networks, and number of genes (Fig. 5d)

Show code
## load the LCC results 
source("../source/cache_all_networks_and_disease_genes.R")
## count number of networks, only coex

N_signif_net = processed_result_df %>% 
  filter(type=="Co-expression", LCC.signif!="none") %>%
  count(name, name = "significant_tissues")


## load the gene sets
Orphanet_df <-rare_genetic_diseases_genes$disgene_df %>% select(name, N)

Orphanet_df <- left_join(Orphanet_df, N_signif_net)

# retrieve median values
AUC_folds_perf_compare <- AUC_folds_summary_updated_plot %>% 
  pivot_wider(., id_cols = name, names_from = label, values_from = med) %>%
  mutate(dif = `Significant layers` - `All layers`) %>%
 # rename(name_short = "name") %>%
  left_join(x=., y = Orphanet_df) #, by = c(name = "name_short"))

AUC_allnets_median <- AUC_folds_summary_updated_plot %>% 
#    rename(name_short = "name") %>%
  dplyr::select(-Q25, -Q75) %>%
  dplyr::filter(label == 'Significant layers') %>%
  left_join(x=., y = Orphanet_df)


p1 <- ggplot(AUC_folds_perf_compare %>%
               ungroup %>%
               mutate(label = ifelse(dif< 0, name,""))
             , aes(x = significant_tissues, y = dif)) + 
  geom_point(col = "grey40") + 
  theme_cowplot() + guides(size = F) + 
  geom_hline(yintercept = 0, linetype = "dotted", col = "grey20") + 
 # stat_summary(fun.data= mean_cl_normal) + 
  geom_smooth(method='lm', se = F) +
  xlab("# Significant tissues") + ylab(expression("Performance diff. ("*Delta*"AUC)"))

p2 <- ggplot(AUC_allnets_median, aes(x = N, y = med)) + geom_point(col = "grey40") + theme_cowplot() + 
  guides(size = F) + xlab("# Associated genes") + ylab("Median AUC") + 
  scale_x_log10() +
   geom_smooth(method='lm', se = F)   



p <- p1 + p2

#ggsave(filename = "../Figs/scatter_AUC_vs_ntissues_updated.pdf",plot = p,  height = 3, width = 6)

p

Corresponding statistical parameters on the left plot (Performance difference vs #tissue):

Show code
cor.test(AUC_folds_perf_compare$dif,AUC_folds_perf_compare$significant_tissues, method = "spearman")

    Spearman's rank correlation rho

data:  AUC_folds_perf_compare$dif and AUC_folds_perf_compare$significant_tissues
S = 2937.2, p-value = 0.03071
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.4511678 

Corresponding statistical parameters on the right plot (Performance vs #genes):

Show code
cor.test(AUC_allnets_median$N, AUC_allnets_median$med, method = "spearman")

    Spearman's rank correlation rho

data:  AUC_allnets_median$N and AUC_allnets_median$med
S = 3642, p-value = 6.679e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.7994071 

Performance difference: using relevant networks vs all networks (Suppl. Fig. 7b)

Show code
p <- AUC_folds_perf_compare %>%
  left_join(., distinct(AUC_folds_summary_updated_plot, name)# %>% rename(name_short = "name")
            , by="name") %>%
  ungroup %>% 
  mutate(name = fct_reorder(name, significant_tissues)) %>%
  ggplot(., aes(x = name, y = significant_tissues, fill=dif)) + geom_col() + theme_minimal() + coord_flip() + 
  scale_fill_gradient2(low = "#66c2a5", high = "#8da0cb") + 
  ylab("# Significant tissues") + xlab("") + labs(fill = "")+ ggtitle("Performance difference", subtitle = "Selected networks vs all networks")

#ggsave("../Figs/barchart_n_tissues_vs_performance_difference.pdf", height = 4, width = 6)

p

Performance upon curated layers removal

Performance for 10-fold CV for 26 disease group gene retrieval upon layer removal (Suppl. Fig. 7d) - PPI - Co-Pathway - GOMF - GOBP - HP - MP - All of them

Show code
AUC_folds_exclude <- readRDS("../cache/fold_cv_processed_results_revision_excludesSingleLayers.RDS") %>%
  rbind( readRDS("../cache/fold_cv_processed_results_revision.RDS") %>% filter(label == "Largescale_signif"))


#AUC_folds_exclude <- AUC_folds_exclude %>% 
#  group_by(name, label) %>% 
#  summarise(mean = mean(AUC), med = median(AUC), Q25 = quantile(AUC, 0.25), Q75 = quantile(AUC, 0.75)) #%>%

#AUC_folds_exclude_plot <- rbind(AUC_folds_exclude,
#                                AUC_folds_revision %>% dplyr::filter(label %in% c("signif_withCoexCore", "Largescale_signif_weighted", "Largescale_signif", "coex_Signif", "coex_core")))

AUC_folds_Signif_all <-  readRDS("../cache/fold_cv_processed_results_revision.RDS") %>%
  filter(label == "signif_withCoexCore") 
existing_disease <- unique(AUC_folds_Signif_all$name)

new_df <- AUC_folds_summary <- readRDS("../cache/fold_cv_processed_results.RDS") %>% filter(!name %in% existing_disease, label == "significant networks")  %>% rename(AUC = "mean") %>% select(name, AUC)

new_df_dup <- new_df
for(i in 1:9){
  new_df_dup <- rbind(new_df_dup, new_df)
}

new_df_dup <- new_df_dup%>%
  mutate(name = as.character(name)) %>%
  select(AUC, name) %>% as.data.frame()

AUC_folds_Signif_all <- AUC_folds_Signif_all %>%
  select(-label) %>%
  rbind(new_df_dup) %>%
  mutate(label = "Signif_All")

AUC_folds_exclude_plot <- rbind(AUC_folds_exclude,
                                AUC_folds_Signif_all)


order_network <- AUC_folds_exclude_plot %>% group_by(label) %>% summarise(median = mean(AUC)) %>% arrange(median) %>% pull(label) %>% as.character() %>% rev()


#AUC_folds_Signifs <- AUC_folds_Signifs %>%
#  filter(excluded_layer %in% order_network) %>%
#  mutate(excluded_layer = fct_relevel(excluded_layer, order_network),
#         excluded_layer = recode_factor(excluded_layer, "All Significant layers" = "Signif_all"),
#         excluded_layer = fct_relevel(excluded_layer, "All Significant layers"),
#        excluded_layer = fct_relevel(excluded_layer, "LargeScaleNet", after = Inf))


#order_network <- AUC_folds_exclude_plot %>% group_by(excluded_layer) %>% summarise(median = median(med)) %>% arrange(median) %>% pull(excluded_layer)

order_network_label <- c("All significant", "Co-Pathway\n(Reactome)", "GO\n(MF)","Phenotype\n(MP)", "GO\n(BP)", "PPI", "Phenotype\n(HP)", "All curated")

AUC_folds_exclude_plot$layer <- factor(AUC_folds_exclude_plot$label, levels = (order_network), labels = (order_network_label), ordered = T)

AUC_folds_exclude_plot  <- AUC_folds_exclude_plot %>% ungroup

#AUC_folds_exclude_plot <- AUC_folds_exclude_plot[order(AUC_folds_exclude_plot$layer),]

#AUC_folds_exclude_plot$layer <- as.character(AUC_folds_exclude_plot$label)

list_signif <- lapply(order_network[order_network_label!="All significant"], function(x) c("All significant", x))

pacman::p_load(ggsignif)

p_compare_all_nets <- ggwithinstats(
  data = AUC_folds_exclude_plot ,
  x = layer,
  y = AUC,
    
  p.adjust.method = "BH",
 title = "AUC values upon removing layers",
#  caption = "Data source: `WRS2` R package",
  ggtheme = theme_cowplot(), 
  xlab = "Excluded layer",
  ylab = "10-fold CV median AUC", 
  sample.size.label = F,
#ggplot.component = list(scale_color_brewer(palette = "Blues"), scale_fill_brewer(palette = "Blues")),
type = "non-parametric"#, 
#ggplot.component = c(col = "grey")
#  ggtheme = ggthemes::theme_fivethirtyeight()
, pairwise.comparisons = F,
 centrality.plotting = FALSE
)


#ggsave("../Figs/compare_CuratedNetwork_removal_AUC_new.pdf", p_compare_all_nets, height = 4, width =7, scale = 1.25)

p_compare_all_nets

The impact of removing PPI layer (Suppl. Fig. 7c)

Show code
# load the randomisation results
if(!file.exists("../cache/fold_cv_processed_results_revision_LargeScalePPIRandomisation.RDS")){
  source("../source/randomization_10fold_CV_largescale_PPI.R")
}

AUC_folds_PPI_additional <- readRDS("../cache/fold_cv_processed_results_revision.RDS") %>%
  filter(grepl("PPI", label)) %>%
#  rbind(., AUC_folds_PPI_largescale_random) %>%
                          group_by(name, label) %>%
  summarise(mean = mean(AUC), med = median(AUC), Q25 = quantile(AUC, 0.25), Q75 = quantile(AUC, 0.75))
  
AUC_folds_PPI <- readRDS("../cache/fold_cv_processed_results.RDS") %>%
  filter(grepl("PPI", label)) %>%
  select(name, mean, med, Q25, Q75, label) %>%
  rbind(., AUC_folds_PPI_additional) %>%
  mutate(label = factor(label, levels = c("PPI only", "PPI_curated", "PPI_largescale"), 
                        labels = c("Full", "Curated", "Large scale")),
         label_plot = label)
  



p_compare_PPI <- ggwithinstats(
  data = AUC_folds_PPI,# %>% filter(!excluded_layer %in% c("CoEx", "CoEssential")) %>% ungroup,
  x = label_plot,
  y = med,
  p.adjust.method = "BH",
  title = "AUC values upon removing layers",
#  caption = "Data source: `WRS2` R package",
  ggtheme = theme_cowplot(), 
  xlab = "PPI set",
  ylab = "10-fold CV median AUC", 
#ggplot.component = list(scale_color_brewer(palette = "Blues"), scale_fill_brewer(palette = "Blues")),
type = "non-parametric"#, 
#ggplot.component = c(col = "grey")
#  ggtheme = ggthemes::theme_fivethirtyeight()
)

### Results from large sclae randomization
AUC_folds_randomisation <- readRDS("../cache/fold_cv_processed_results_revision_LargeScalePPIRandomisation.RDS") %>% group_by(name) %>% summarise(AUC = mean(AUC)) %>%
  mutate(layer = "Random\nControl")

AUC_folds_randomisation <- rbind(AUC_folds_randomisation,
                                 # to extend the axis: quick hack to add another label going up to one.
                                 # this serves no purpose in the analyses
                                 AUC_folds_randomisation %>% select(-layer) %>% 
                                   mutate(AUC = seq(0.52, 1, 0.02),
                                   layer ="zAxis spanner"))

AUC_folds_randomisation[nrow(AUC_folds_randomisation), "AUC"] = 1  

p_compare_PPI_rand <- ggstatsplot::ggbetweenstats(
  data = AUC_folds_randomisation %>% ungroup,
  x = layer,
  y = AUC,
    
  p.adjust.method = "none",    
 title = "Randomization PPI large-scale",
#  caption = "Data source: `WRS2` R package",
  ggtheme = theme_cowplot(), 
  xlab = "Layer",
  ylab = "10-fold CV median AUC", 
  sample.size.label = F,
type = "nonparametric",
#ggplot.component = list(scale_color_brewer(palette = "Blues"), scale_fill_brewer(palette = "Blues")),, 
#ggplot.component = c(col = "grey")
#  ggtheme = ggthemes::theme_fivethirtyeight()
pairwise.comparisons = F#,
# centrality.plotting = FALSE
)


#ggsave("../Figs/compare_PPILargeScale_rand.pdf", p_compare_PPI_rand, height = 4, width =3, scale = 1)
#ggsave("../Figs/compare_PPI_network_removal_AUC.pdf", p_compare_PPI, height = 4, width =3, scale = 1.125)

p_compare_PPI + p_compare_PPI_rand

```{.r .distill-force-highlighting-css}

Application to gene prioritization in rare disease patients

Number of candidate genes per patient

Show code
gene_list <- read_csv("../data/patient_gene_list.csv") %>%
  mutate(Patient = factor(Patient, level = c("N_064", "N_126", "N_020", "N_026", "N_003", "N_129", "N_007", 
"N_040"), labels = paste0("P", 1:8)))


gene_list %>% count(Patient, name = "number of genes")
# A tibble: 8 x 2
  Patient `number of genes`
* <fct>               <int>
1 P1                     47
2 P2                     33
3 P3                      9
4 P4                      4
5 P5                      6
6 P6                     13
7 P7                     10
8 P8                      7

Average number of genes with high confident variants detected in each patient:

Show code
nrow(gene_list)/length(unique(gene_list$Patient))
[1] 16.125

These patient are diagnosed, i.e. has a confirmed causal variant. We wonder whether the causal gene can be picked up by the algorithm. The patient phenotype belong to rare genetic neurological disorder.

Patient phenotypes (Suppl. Fig. 4c)

Show code
patient_hpo_df <- read_csv("../data/patient_hpo_terms.csv", 
                           col_names = c('Patient', "hpo_id", "hpo_label"), skip = 1) %>%
  mutate(hpo_id = str_trim(hpo_id, side = 'both')) %>%
  mutate(patient_ID = factor(Patient, levels = unique(Patient), label = paste0("P", 1:length(unique(Patient)))))

hpo_counts <- patient_hpo_df %>%
  count(hpo_label) %>%
  arrange(n)  %>%
  mutate(hpo_label = factor(hpo_label, levels = hpo_label))

patient_hpo_df$hpo_label = factor(patient_hpo_df$hpo_label, levels = hpo_counts$hpo_label)

p = ggplot(patient_hpo_df %>% mutate(found=T), aes(x=patient_ID, y =hpo_label, fill = found)) + 
  #geom_tile(col = "white") + 
  geom_point(aes(colour = found), size = 6) +
  guides(fill = F, col=F) + 
  theme_minimal() + xlab("Patient") + ylab("HPO label") +
  coord_equal()

#ggsave("../Figs/heatmap_patient_phenotype.pdf", p)

p

Patient-specific rankings

The prioritization was performed using informed propagation, with seed genes specific to each patient.

Show code
precomputed_patient_ranked_genes <- "../cache/patient_ranking_results.RDS"

if(!file.exists(precomputed_patient_ranked_genes)){
  source("../source/patient_causal_gene_prediction.R")
}

# # load precomputed results
patient_annotated_df <- readRDS(precomputed_patient_ranked_genes) %>%
  mutate(Patient = factor(Patient, level = c("N_064", "N_126", "N_020", "N_026", "N_003", "N_129", "N_007", 
"N_040"), labels = paste0("P", 1:8)))

Below is the causal gene for each patient, and how well the algorithm performs for each case.

Show code
# show the causal genes and how well the algorithm performs
correct_rank <- patient_annotated_df %>% 
  filter(Diagnostic=="YES", network_set == "signif") %>%
  select(-network_set, -avg) %>% 
  arrange(network_rank, -total_variants)

correct_rank
# A tibble: 8 x 5
# Groups:   Patient [8]
  Patient GENE    Diagnostic network_rank total_variants
  <fct>   <chr>   <chr>             <dbl>          <int>
1 P1      PRDM12  YES                   1             46
2 P2      LAMA1   YES                   1             33
3 P3      HUWE1   YES                   1              9
4 P4      CNTNAP2 YES                   1              4
5 P5      CDKL5   YES                   2              6
6 P6      LAMA1   YES                   4             13
7 P7      BCL11A  YES                   4             10
8 P8      CHD2    YES                   4              7

Patient ranking plot (Fig. 5f, Suppl. Fig. 4d)

Show code
# ensure it keeps the order 

p_rank = list()
for( i in unique(patient_annotated_df$network_set)){
  correct_rank <- patient_annotated_df %>% 
  filter(Diagnostic=="YES", network_set == i) %>%
  select(-network_set, -avg) %>% 
  arrange(network_rank, -total_variants)
  
p_rank[[i]]  <- ggplot(correct_rank) +
  geom_segment( aes(x=Patient, xend=Patient, y=0, yend=total_variants), color="grey", size = 4) +
  geom_point(aes(x=Patient,y=network_rank), size=5, col="orange", shape = 15) +
  geom_text(aes(x=Patient,y=network_rank, label = network_rank), col = "white", fontface = "bold")+
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank()
  ) +
  xlab("Patient") +
  ylab("# Candidate genes")

#ggsave(p_rank[[i]], filename = paste0("../Figs/patient_gene_ranking_",i,".pdf"), width = 2.5, height = 3)
}

library(patchwork)
p_rank[["signif"]] + ggtitle("Signif") +
  p_rank[["all"]] + ggtitle("All networks") +  
  p_rank[["ppi"]]  + ggtitle("PPI only")

Patient AUC plot (Fig. 5g)

Show code
AUC_plot <- patient_annotated_df %>%
  mutate(pos = (network_rank-1)/(total_variants-1), 
         label = ifelse(Diagnostic=="NO", 0,1))

pacman::p_load(pROC)
rocvals <- list()
rocvals_df <- list()
for(i in unique(AUC_plot$network_set)){
  df <- AUC_plot %>% filter(network_set==i)
  response <- df$label
  predictor <- df$pos
  rocvals[[i]] <- pROC::roc(response, predictor)
  rocvals_df[[i]] <- tibble(Sensitivity =  rocvals[[i]]$sensitivities, Specificity = rocvals[[i]]$specificities)
}


rocvals_merged_df <- bind_rows(rocvals_df, .id = "rank_type")

palettes <- c('#66c2a5','#fc8d62','#8da0cb')

p <- ggplot(rocvals_merged_df, aes(x = Specificity, y = Sensitivity, col = rank_type)) + 
  geom_line(size=2) + 
  scale_x_reverse() + 
  scale_color_manual(values = palettes) + guides(col = FALSE) +
  cowplot::theme_cowplot()

#ggsave(p, filename = "../Figs/patient_gene_ranking_AUC.pdf", width = 3, height = 3)

p

The AUC for each method is as shown below.

Show code
tibble(methods = names(rocvals), auc  = sapply(rocvals, function(x) x$auc))
# A tibble: 3 x 2
  methods   auc
  <chr>   <dbl>
1 all     0.872
2 signif  0.840
3 ppi     0.718

The corresponding p-values are:

Show code
pairs = combn(names(rocvals),2) %>% t %>% as.tibble()
auc_compare_test <- apply(pairs,1, function(x) roc.test(rocvals[[x[1]]], rocvals[[x[2]]], ))

pairs$pval <- sapply(auc_compare_test, function(x) x$p.value)

pairs
# A tibble: 3 x 3
  V1     V2      pval
  <chr>  <chr>  <dbl>
1 all    signif 0.691
2 all    ppi    0.192
3 signif ppi    0.329

RD-connect data analyses

Note: Access to the data is restricted to validated users, Please see the details of our access, queries and filtering in the Methods section of the manuscript.

Show code
library(pacman)
p_load(cowplot)
###################


# RD-connect patient queries (cia RDconnect platform)
patient_data <- read_tsv("../data/RDconnect_queries.tsv")

#patient_data %>% group_by(`Solved status`,`In platform`) %>% count 
Show code
file_variants = list.files("../data/rd-connect/")

# check if all patients are in the files
ID_files <- sapply(file_variants, function(i) str_remove(i, ".tsv"))

check.missID = FALSE
if(check.missID){
    #missing variants (as upload process)
  missed_IDs <- setdiff(solved_patient_annotated$`Experiment ID`, ID_files)
  
  if(length(missed_IDs)==0){
    print("All patients have corresponding files")
  } else{
    sprintf("missing Experimental IDs: %s",paste(missed_IDs, collapse = ", "))
  }
}

patient_variants <- list()
for(i in file_variants){
  patient_variants[[str_remove(i, ".tsv")]] <- read_tsv(paste0("../data/rd-connect/",i), skip_empty_rows = T, col_types = 'cccccc-ccccccccccc')
}

patient_variants_df <- bind_rows(patient_variants, .id = "Experiment ID")



solved_patients <- patient_data %>% 
  separate_rows(Genes,sep =  ";") %>%
  filter(`In platform`, grepl("solved", Genes)) %>%
  mutate(causal_gene = str_remove(Genes,pattern =  " \\(solved\\)"))

# merge results from all methods

solved_patient_annotated <- solved_patients %>% 
  select(`Participant ID`, causal_gene, `Experiment ID`) %>%
  rename(Patient= `Participant ID`, "GeneName"=causal_gene) %>%
  mutate(causal = TRUE)


patient_variants_df <- left_join(patient_variants_df,solved_patient_annotated[,c('Patient', 'Experiment ID')]) %>% 
  filter(Chr %in% c(1:22, "X","Y","MT"))

# check if there are likely missing variants
#chr_count <- patient_variants_df %>%
#  count(`Experiment ID`, Chr)

#ggplot(chr_count) + geom_tile(aes(y = `Experiment ID`, x = Chr, fill = n))

Adding internal cohort + add causal gene to the gene list if missing

Show code
# simply merge 
RDConnect_patient_gene_df <- patient_variants_df  %>%
 # mutate(inPatient = TRUE) %>%
  select(`Experiment ID`, Patient, Gene)  %>%
  dplyr::rename(GeneName = Gene) %>%
  distinct() %>%
  # add the causal gene list to ensure it is in the patient list
  rbind(solved_patient_annotated %>% select(-causal))  %>%
  # count, if = 2 means it is found in both variant list and causal gene list -> meaning that the gene is found in the variant list
  count(`Experiment ID`, Patient, GeneName, name = "causalInVariantList") %>%
  mutate(causalInVariantList = factor(causalInVariantList, levels = c(1:2), labels = c(F,T)),
         cohort = "RDConnect") %>%
  left_join(., solved_patient_annotated)


# Internal patient
internal_patient_gene_list <- read_csv("../data/patient_gene_list.csv") %>%
  mutate(`Experiment ID` = paste0("INT_", Patient_ID),
         causal = ifelse(Diagnostic == "YES", TRUE, NA)) %>%
  dplyr::rename(GeneName = GENE) %>%
  mutate(causalInVariantList = T,
         cohort = "Internal") 

# merge RDconnect and local data
patient_gene_df <- rbind(RDConnect_patient_gene_df, internal_patient_gene_list %>%
                           select(colnames(RDConnect_patient_gene_df)))


# check if local gene is labelled in proper ID
all_causal_genes <- patient_gene_df %>% filter(!is.na(causal)) %>% pull(GeneName) %>% unique()

source("../functions/fn_source.R")
all_causal_genes_ID <- IDconvert(all_causal_genes, from ="SYMBOL", to = "ENTREZID")

# check which genes as no equiv id
all_causal_genes_without_ID = all_causal_genes[is.na(all_causal_genes_ID)]

# manually modify the genes without valid causal gene
genes_without_ID_replace <- c("SIL1", "GMPPB", "TRNL1", "TRAPPC9")
names(genes_without_ID_replace) = c("*608005", "*615320", "MT-TL1", "TRAPP9")

patient_gene_df <- patient_gene_df %>%
  mutate(GeneName = ifelse(GeneName %in% names(genes_without_ID_replace), genes_without_ID_replace[GeneName], GeneName))

Filtering patients without causal genes described in their variant list

Show code
patient_all_genes_df <- patient_variants_df  %>%
 # mutate(inPatient = TRUE) %>%
  left_join(., solved_patient_annotated, by = c("Experiment ID", "Patient", Gene = 'GeneName')) %>%
  distinct(`Experiment ID`, Patient, Gene, causal) 


patients_withGenesFound <- patient_all_genes_df %>% count(`Experiment ID`, causal)  %>% 
  filter(!is.na(causal)) %>% mutate(genefound = TRUE)


patient_genes_df <- left_join(solved_patient_annotated, patients_withGenesFound[,c(1,4)])  

solved_patient_genes <- patient_genes_df %>% filter(genefound) %>% pull(GeneName, name = `Experiment ID`)

Number of genes per patients

Show code
library(report)

genes_per_solved_patients <- patient_all_genes_df %>%
  filter(`Experiment ID` %in% names(solved_patient_genes)) %>%
  count(`Experiment ID`)

p <- ggplot(genes_per_solved_patients, aes(x=n, y = "")) + 
  geom_violin(fill = "#fccf66") + 
  geom_jitter(col = "#ebaf26" ) +
 # ggforce::geom_sina() +
    cowplot::theme_minimal_vgrid() +
    ggtitle("Distribution of genes", subtitle = report(genes_per_solved_patients$n)) +
    xlab ("# genes per patient, filtered by rare variants") + 
    ylab("")

#ggsave("../Figs/RDconnect_genes_per_patient_violin.pdf", width = 6, height = 3)

p

HPO per patient - RDconnect

Show code
RDConnect_solved_IDs <- patient_gene_df %>% filter(cohort == "RDConnect") %>% pull(`Experiment ID`) %>% unique()

patient_hpo_df <- patient_data %>%
  filter(`Experiment ID` %in% RDConnect_solved_IDs) %>%
  separate_rows(`HPO ids`, sep = ";") %>%
  # ensures the HPO term is properly labelled (start with HP)
  filter(str_starts(string = `HPO ids`, pattern = "HP"), str_count(`HPO ids`)==10) %>%
  distinct(`Experiment ID`,`HPO ids`) 
 
HPO_count <- patient_hpo_df %>%
  count(`HPO ids`) %>%
  arrange(-n)

HPO_details <- read_tsv("../data/raw_data/HPO_phenotype_to_genes.txt", 
                        col_types = c('cc-----'), col_names = c("HPOID", "HPOname"),
                        skip = 1) %>%
  distinct()

HPO_count <- left_join(HPO_count, HPO_details, by = c(`HPO ids` = "HPOID"))

HPO_count$HPOname <- factor(HPO_count$HPOname, levels = rev(HPO_count$HPOname))

# most abundant phenotypes
top_HP_terms <- HPO_count[1:20,]

# relabel the particular term that are too long
top_HP_terms$HPOname <- fct_recode(top_HP_terms$HPOname, !!!c("Delayed speech"="Delayed speech and language development")) 
#labels(top_HP_terms$HPOname)[2] <- "Delayed Speech"
top_HP_terms$queried_term <- grepl("Intellectual disability", top_HP_terms$HPOname)

p_HPOcount <- ggplot(top_HP_terms, aes(x = n, y = HPOname)) + 
  geom_col(aes(fill = queried_term)) + 
  theme_cowplot() +
  ggtitle("Most frequent phenotypes") +
  xlab ("# Patients") + 
  ylab("Phenotype term") +
  scale_fill_manual(values = rev(c("#fccf66", "#8DB5CD"))) +
  guides(fill = F)

#ggsave("../Figs/RDconnect_top20_phenotypes.pdf", plot = p_HPOcount, width = 6, height = 4)

p_HPOcount

Show code
HPO_count_per_patient <- patient_hpo_df %>%
  count(`Experiment ID`) %>%
  arrange(-n)


Patients_with_diagnostics <-  tibble(`Experiment ID` = patient_data$`Experiment ID`,
                                     Diagnosed = patient_data$`ORDO disorder` != "unknown")
                                     #Diagnosed = ifelse(patient_data$`OMIM disorder` != "unknown" | patient_data$`ORDO disorder` != "unknown" , T, F) )

HPO_count_per_patient_tile <- HPO_count_per_patient %>%
  left_join(., Patients_with_diagnostics) %>%
  arrange(Diagnosed) %>%
  group_by(n) %>%
  mutate(pos = 1:n())

p_HPO_count_per_patient_tile <- ggplot(HPO_count_per_patient_tile, aes(x = n, y = pos)) + 
  geom_tile(aes(fill = Diagnosed), col = "#DADADA", size = 0.8) + 
  theme_cowplot() +
  ggtitle("# Phenotypic terms per patient") +
  xlab ("# Phenotype terms") + 
  ylab("# Patients") +
  scale_y_continuous(breaks = c(1,3,5,7,9)) +
  scale_fill_manual(values = rev(c("#fccf66", "#8DB5CD"))) +
  guides(fill = F)

#ggsave("../Figs/RDconnect_HPO_count_per_patient_tiled.pdf", plot = p_HPO_count_per_patient_tile, width = 6*1.25, height = 2*1.25)

p_HPO_count_per_patient_tile

HPO association to seed genes

Show code
internal_patient_hpo_df <- read_csv("../data/patient_hpo_terms.csv", 
                           col_names = c('Patient', "hpo_id", "hpo_label"), skip = 1) %>%
  mutate(hpo_id = str_trim(hpo_id, side = 'both')) %>%
  left_join(., internal_patient_gene_list[,c("Patient","Patient_ID", "Experiment ID")]) %>%
  distinct()


patient_hpo_df_all_cohort <- rbind(patient_hpo_df, 
                                   internal_patient_hpo_df %>% 
                                     mutate(`HPO ids` = hpo_id) %>%
                                     select(`Experiment ID`, `HPO ids`) 
                                     
                                     )


file_hpo <- "../cache/rdconnect_hpo_to_genes.RDS"

if(!file.exists(file_hpo)){
  pacman::p_load(pbapply)
  
  # create a function for retrieving genes associated to a hpo term
  hpo_genes_query = function(term){
    url_hpo <- httr::GET(paste0("https://hpo.jax.org/api/hpo/term/",term,"/genes?max=-1"))
    gene_list <- httr::content(url_hpo)
    
    # return gene label 
    gene_names <- sapply(gene_list$genes, function(x) x$entrezGeneSymbol)
    return(list(gene_names))
  }
  
  hpo_gene_association <- pbapply::pbsapply(unique(patient_hpo_df_all_cohort$`HPO ids`), hpo_genes_query)

  
  saveRDS(hpo_gene_association, file_hpo)
} else{
  print("read precomputed HPO association")
  hpo_gene_association <- readRDS(file_hpo)
}
[1] "read precomputed HPO association"
Show code
# take only genes exist in network as seeds
seed_genes <- patient_hpo_df_all_cohort %>% 
  group_by(`Experiment ID`) %>%
  summarise(hpo_terms = list(`HPO ids`)) %>%
  rowwise() %>%
  mutate(associated_genes = list(hpo_gene_association[hpo_terms])) #%>%
 # left_join(., solved_patients[,c("Participant ID", "Experiment ID")])

Prediction based on genes associated with the phenotypes alone

Show code
# Prediction based on HPO

genes_associated_with_patient_phenotype <- seed_genes %>% rowwise() %>%
  mutate(all_hpo_associated_genes = list(unique(unlist(associated_genes)))) %>%
  pull(all_hpo_associated_genes, name = `Experiment ID`)

# how mayne gene phenotype patients are actually in the patients


patient_all_genes_with_features_df <- patient_gene_df %>%
  rowwise() %>%
  mutate(inHPO = GeneName %in% genes_associated_with_patient_phenotype[[`Experiment ID`]])


sum_patient_genes_in_HPO_df <- patient_all_genes_with_features_df %>%
  filter(cohort == "RDConnect") %>%
  count(`Experiment ID`, inHPO, name = "n_inHPO") %>% 
  filter(inHPO) %>%
  select(-inHPO) %>%
  inner_join(., genes_per_solved_patients) %>%
  pivot_longer(., cols = c("n","n_inHPO"), names_to = "group", values_to = "count") %>%
  mutate(group = factor(group, levels = rev(c("n","n_inHPO")), labels = rev(c("All genes", "Genes related to \n patient phenotypes"))))


p <- ggplot(sum_patient_genes_in_HPO_df, aes(x=count, y = group)) + 
  geom_violin(fill = "#fccf66") + 
  geom_jitter(col = "#ebaf26" ) +
 # ggforce::geom_sina() +
    cowplot::theme_minimal_vgrid() +
    ggtitle("Distribution of genes")+#, subtitle = report(genes_per_solved_patients$n)) +
    xlab ("# genes per patient, filtered by rare variants") + 
    ylab("") + 
  scale_x_log10()

#ggsave("../Figs/RDconnect_genes_per_patient_violin_with_HPO_filter.pdf", width = 6, height = 3)

p

Gene prioritization based on different feature-level

Show code
# features
gene_prop_df <- read_tsv("../cache/gene_features.tsv")

patient_all_genes_with_features_df <- patient_all_genes_with_features_df %>%
  left_join(., gene_prop_df, by = c(GeneName = "gene"))
  


# add ranks
patient_all_genes_df_with_rank <- patient_all_genes_with_features_df %>%
  group_by(`Experiment ID`) %>%
  mutate(rank_inHP = rank(!inHPO, ties.method = "random"),
         rank_HP = rank(-HP_terms),
         rank_popularity = rank(PubMedRank),
         rank_expression_BrainAll = rank(-AllBrainAvg),
         rank_expression_FrontalCortex = rank(-FrontalCortex),
         rank_pathwaycounts = rank(-n_pathways)
         )


# plot ROC AUC

feature_based_AUC_plot <- patient_all_genes_df_with_rank %>%
  select(starts_with("rank_"), `Experiment ID`, GeneName, causal) %>%
  mutate(total_gene = n()) %>%
  pivot_longer(cols = starts_with("rank_"), names_to = "group", values_to = "rank") %>%
  mutate(pos = (rank-1)/(total_gene-1), 
         label = ifelse(is.na(causal), 0,1))

pacman::p_load(pROC)
rocvals <- list()
rocvals_df <- list()
auc_vals <- list()

for(i in unique(feature_based_AUC_plot$group)){
  df <- feature_based_AUC_plot %>% filter(group==i)
  response <- df$label
  predictor <- df$pos
  rocvals[[i]] <- pROC::roc(response, predictor)
  auc_vals[[i]] <- pROC::auc(response, predictor)
  rocvals_df[[i]] <- tibble(Sensitivity =  rocvals[[i]]$sensitivities, Specificity = rocvals[[i]]$specificities)
}


rocvals_merged_df <- bind_rows(rocvals_df, .id = "rank_type")

palettes <- c('#66c2a5','#fc8d62','#8da0cb')

p_roc_gene_prop <- ggplot(rocvals_merged_df, aes(x = Specificity, y = Sensitivity, col = rank_type)) + 
  geom_line(size=2) + 
  scale_x_reverse() + 
  #scale_color_manual(values = palettes) + 
  scale_color_brewer() + 
#  guides(col = FALSE) +
  cowplot::theme_cowplot()


p_roc_gene_prop

Comparison with the informed multiplex propagation

Show code
patient_network_ranking_results <- "../cache/test_rdconnect_ranking_results.RDS"

# if the result file are not there yet, load the data
if(!file.exists(patient_network_ranking_results)){
  source("../source/prioritize_patient_rdconnect.R")
}

patient_annotated_df <- readRDS(patient_network_ranking_results)
# this is to compare causal gene with the entire genome
patient_annotated_df_all_methods <- bind_rows(patient_annotated_df, .id = "network_set") %>%
  mutate(`Experiment ID` = Patient) %>%
  select(network_set, `Experiment ID`,  GeneName, seed, avg)


gene_per_patients <- patient_gene_df %>%
    count(`Experiment ID`, name = "total_genes") 

patient_all_genes_rank <- patient_annotated_df_all_methods %>%
  #dplyr::rename(Patient = "Experiment ID") %>%
  # add the causal column
   inner_join(., patient_gene_df) %>%
  # add gene count per patient + filter for patients with causal genes in the variant list
  inner_join(., gene_per_patients) %>%
  group_by(`Experiment ID`, network_set) %>%
  arrange(-avg) %>%
  mutate(network_rank = rank(-avg)) #%>%

patient_all_genes_rank_filtered <- patient_all_genes_rank %>%
  filter(causal) %>%
  filter(!is.na(network_set)) %>%
  mutate(rank_group = base::cut(network_rank, c(0:5, 10,20,500), labels = c(1,2,3,4,5,"top10","top20","over20")))

patient_causal_gene_rank_df <- patient_all_genes_rank_filtered %>%
  select(-c(avg, rank_group)) %>%
  pivot_wider(names_from = "network_set", values_from = "network_rank") 



###### AUC plot for patients

AUC_plot <- patient_all_genes_rank %>%
  mutate(pos = (network_rank-1)/(total_genes-1),
         label = ifelse(is.na(causal), 0,1))

pacman::p_load(pROC)
rocvals_df <- list()
for(i in unique(AUC_plot$network_set)){
  df <- AUC_plot %>% filter(network_set==i)
  response <- df$label
  predictor <- df$pos
  rocvals[[i]] <- pROC::roc(response, predictor)
  auc_vals[[i]] <- pROC::auc(response, predictor)
  rocvals_df[[i]] <- tibble(Sensitivity =  rocvals[[i]]$sensitivities, Specificity = rocvals[[i]]$specificities)
}


rocvals_merged_df_network <- bind_rows(rocvals_df, .id = "rank_type")
Show code
 # look at the two of them closely

palettes <-  c(RColorBrewer::brewer.pal(6, "Greens"), RColorBrewer::brewer.pal(3, "Blues"))

results_combine <- rbind(rocvals_merged_df, rocvals_merged_df_network)
results_combine$rank_type <- factor(results_combine$rank_type, levels = c(unique(rocvals_merged_df$rank_type), unique(rocvals_merged_df_network$rank_type)))

p <- ggplot(results_combine, aes(x = Specificity, y = Sensitivity, col = rank_type)) + 
  geom_line(size=2) + 
  scale_x_reverse() + 
  scale_color_manual(values = palettes) + #guides(col = FALSE) +
  cowplot::theme_cowplot()

p

Reduce the information to network examples

Show code
#palettes <-  c(RColorBrewer::brewer.pal(5, "Greens")[2:5], "#3182BD")
#palettes <-  c(RColorBrewer::brewer.pal(5, "Greens")[2:5],  "grey80")
palettes <-  c(paste0("LightSkyBlue", 1:4),  "#FCCE67")

selected_features <- c("rank_inHP", "rank_popularity", "rank_expression_BrainAll", "rank_pathwaycounts", "informedMultiPlex")

selected_features_label <- c("Feature: Phenotype overlap", "Feature: PubMed counts", "Feature: Brain expression level", "Feature: Pathway counts", "Reference: Informed Multiplex\nPropagation")

results_combine <- rbind(rocvals_merged_df, rocvals_merged_df_network) %>%
  filter(rank_type %in% selected_features)

results_combine$rank_type <- factor(results_combine$rank_type, levels = selected_features, label = selected_features_label)

p <- ggplot(results_combine, aes(x = Specificity, y = Sensitivity, col = rank_type)) + 
  geom_line(size=2) + 
  scale_x_reverse() + 
  geom_abline(slope = 1, intercept = 1, col = "grey50", linetype = "dashed") +
  scale_color_manual(values = palettes) + #guides(col = FALSE) +
  cowplot::theme_cowplot() +
  theme(legend.position = c(0.4,0.25)) +
  labs(color = "Ranking Criterion")

#ggsave("../Figs/Ranking_AUC_Network_vs_Features_comparison.pdf", p, width = 4, height = 4, scale = 1.25)

p

Comparing AUC levels

although this might look like it is incremental improvements over assigning HP alone, this is valuable for when narrowing down the variant list - or the case in which the causal genes are not yet associated to the phenotypes (elaborate on a few case here as well - yes it sucks but it’s needed.) - also, with the omnigenic model - having such network-based approach for assessing pathogenic effects are anyway required.

Show code
# performing p-value + AUC

auc_vals_df <- tibble(Method  = names(auc_vals), AUC = unlist(auc_vals)) %>%
  filter(Method %in% selected_features) %>%
  mutate(Method = factor(Method, levels = selected_features, labels = str_remove_all(selected_features_label, "Reference: |Feature: ")),
        col = palettes,
         Method = fct_reorder(Method, -AUC)) %>%
  arrange(AUC)


p_auc = ggplot(auc_vals_df, aes(x = AUC, y = Method, fill = Method)) + geom_col() +  scale_fill_manual(values = palettes[c(5, 1:4)]) + guides(fill = FALSE) + cowplot::theme_cowplot()

auc_vals_df
# A tibble: 5 x 3
  Method                              AUC col          
  <fct>                             <dbl> <chr>        
1 "Pathway counts"                  0.636 LightSkyBlue4
2 "Brain expression level"          0.709 LightSkyBlue3
3 "PubMed counts"                   0.742 LightSkyBlue2
4 "Phenotype overlap"               0.887 LightSkyBlue1
5 "Informed Multiplex\nPropagation" 0.951 #FCCE67      

Com[uting p-values based on Delong’s test

Show code
pval_df <- combn(selected_features, 2) %>% 
  t() %>% as_tibble() %>%
   rowwise() %>%  
mutate(pval = pROC::roc.test(rocvals[[as.character(V1)]], rocvals[[as.character(V2)]])$p.value)#,
     # pval = ifelse(pval==1, NA, pval))
#p value between the two best peforming methofds

#ggplot(pval_df, aes(x=Var1, y=Var2, fill = -log10(pval))) + geom_tile(na.rm = T) + theme_nothing() + scale_fill_distiller(direction = -1, na.value = "Transparent")

pval_df
# A tibble: 10 x 3
# Rowwise: 
   V1                       V2                           pval
   <chr>                    <chr>                       <dbl>
 1 rank_inHP                rank_popularity          2.29e-12
 2 rank_inHP                rank_expression_BrainAll 1.47e-15
 3 rank_inHP                rank_pathwaycounts       3.49e-21
 4 rank_inHP                informedMultiPlex        2.89e- 5
 5 rank_popularity          rank_expression_BrainAll 1.01e- 1
 6 rank_popularity          rank_pathwaycounts       3.94e- 7
 7 rank_popularity          informedMultiPlex        8.41e-29
 8 rank_expression_BrainAll rank_pathwaycounts       9.65e- 3
 9 rank_expression_BrainAll informedMultiPlex        3.34e-29
10 rank_pathwaycounts       informedMultiPlex        5.98e-36
Show code
network_based_rank <- patient_all_genes_rank %>%
  ungroup() %>%
  filter(!is.na(causal), !grepl("INT",`Experiment ID`)) %>%
  mutate(group = cut(network_rank, breaks = c(0,5,10,20,50,100,200,1000))) %>%
  count(network_set, group)
  

feature_based_rank <- feature_based_AUC_plot %>%
  ungroup() %>%
    dplyr::rename(network_set = "group") %>%
   filter(!is.na(causal), !grepl("INT",`Experiment ID`)) %>%
  mutate(group = cut(rank, breaks = c(0,5,10,20,50,100,200,1000))) %>%
  count(network_set, group)


rank_group <- rbind(network_based_rank, feature_based_rank) %>%
  filter(n > 0)
Show code
#palettes <-  c(RColorBrewer::brewer.pal(4, "Greens"), "#3182BD")

p <- ggplot(rank_group %>% filter(network_set %in% selected_features, group %in% c("(0,5]", "(5,10]", "(10,20]")) %>%
                                    mutate(network_set = factor(network_set, levels = selected_features, labels = str_remove(selected_features_label, "Reference: ")),
                                           group = factor(group, labels = paste0("Top ", c(5,10,20)))), aes(x = network_set, y = n, fill = network_set)) + 
 geom_col(position = "dodge") +
  facet_grid(.~ group) +
  scale_fill_manual(values = palettes) + #guides(col = FALSE) +
  cowplot::theme_cowplot() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  xlab("Ranking criteria") +
  ylab("# Patients") + labs(fill = "Ranking criterion")

#ggsave("../Figs/ranking_patient_top_compare_methods.pdf", p, width = 8, height = 2)

p

The difference between each method is presumably due to the absence of causal genes in the feature

Show code
rank_group %>% group_by(network_set) %>% 
  summarise(group = group, top5 = n, all_patients = sum(n)) %>% 
  filter(group == "(0,5]") %>%
  mutate(percent = top5/131*100)
# A tibble: 9 x 5
# Groups:   network_set [9]
  network_set                   group  top5 all_patients percent
  <chr>                         <fct> <int>        <int>   <dbl>
1 all                           (0,5]    10          130    7.63
2 informedMultiPlex             (0,5]    64          129   48.9 
3 ppi                           (0,5]    17          128   13.0 
4 rank_expression_BrainAll      (0,5]     9          131    6.87
5 rank_expression_FrontalCortex (0,5]     8          131    6.11
6 rank_HP                       (0,5]    13          131    9.92
7 rank_inHP                     (0,5]     8          131    6.11
8 rank_pathwaycounts            (0,5]     7          131    5.34
9 rank_popularity               (0,5]     4          131    3.05

Temporal-holdout benchmark (Suppl. Fig. 8)

Show code
PanelApp_data <- read_tsv("../data/raw_data/PanelApp_ID_v3.0_2019-12-10.tsv", col_types = paste0('--cc',str_dup("-", 32)))

PanelApp_IDgenes <- PanelApp_data %>% filter(grepl("Expert Review Green", `Sources(; separated)`)) %>% pull(`Gene Symbol`) %>% unique

gene_rank_data <- rbind(feature_based_AUC_plot,
                        AUC_plot %>% 
                          mutate(rank = rank(-avg)) %>%
                          dplyr::rename(group = "network_set", total_gene = "total_genes") %>%
                          select(colnames(feature_based_AUC_plot))) %>%
  filter(!grepl("INT", `Experiment ID`), group %in% c(selected_features, "rank_HP")) %>% 
  mutate(inID = GeneName %in% hpo_gene_association[["HP:0001249"]], 
         inIDPanel = GeneName %in% PanelApp_IDgenes,
         rank_binned = cut(rank, breaks = c(0,5,10,20,50,1000))) 

# genes not in IDPanel
Patient_gene_not_in_Panel <- gene_rank_data %>% ungroup %>%filter(causal, !inIDPanel) %>% pull(`Experiment ID`) %>% unique

AUC_plot_nonPanel <- gene_rank_data %>% ungroup %>% 
  filter(`Experiment ID` %in% Patient_gene_not_in_Panel) %>%
  rename(group = "network_set")

pacman::p_load(pROC)

rocvals_nonPanel = list()
auc_vals_nonPanel = list()
rocvals_df_nonPanel = list()
for(i in unique(AUC_plot_nonPanel$network_set)){
  df <- AUC_plot_nonPanel %>% filter(network_set==i)
  response <- df$label
  predictor <- df$pos
  rocvals_nonPanel[[i]] <- pROC::roc(response, predictor)
  auc_vals_nonPanel[[i]] <- pROC::auc(response, predictor)
  rocvals_df_nonPanel[[i]] <- tibble(Sensitivity =  rocvals_nonPanel[[i]]$sensitivities, Specificity = rocvals_nonPanel[[i]]$specificities)
}


rocvals_merged_df_network_nonPanel <- bind_rows(rocvals_df_nonPanel, .id = "rank_type")


# number of patients where causal genes are not labelled as green
gene_rank_data %>% ungroup %>%filter(causal, !inIDPanel) %>% count(group, rank_binned) %>% group_by(group) %>% summarise(sum = sum(n))
# A tibble: 6 x 2
  group                      sum
* <chr>                    <int>
1 informedMultiPlex           20
2 rank_expression_BrainAll    21
3 rank_HP                     21
4 rank_inHP                   21
5 rank_pathwaycounts          21
6 rank_popularity             21
Show code
rocvals_merged_df_network_nonPanel_plot <- rocvals_merged_df_network_nonPanel %>%
  filter(rank_type %in% selected_features) %>%
  mutate(rank_type = factor(rank_type, levels = selected_features, label = selected_features_label))

p <- ggplot(rocvals_merged_df_network_nonPanel_plot, aes(x = Specificity, y = Sensitivity, col = rank_type)) + 
  geom_line(size=2) + 
  scale_x_reverse() + 
  geom_abline(slope = 1, intercept = 1, col = "grey50", linetype = "dashed") +
  scale_color_manual(values = palettes) + #guides(col = FALSE) +
  cowplot::theme_cowplot() +
  guides(col = F)+
  #theme(legend.position = c(0.5,0.25)) +
  labs(color = "Ranking Criterion")

#ggsave("../Figs/Ranking_AUC_Network_vs_Features_comparison_genesNotFoundinPanel.pdf", p, width = 4, height = 4, scale = 1.25)

p

Ranking based on group:

Show code
AUC_plot_nonPanel_count <- AUC_plot_nonPanel %>% filter(causal, !inIDPanel, network_set %in% selected_features) %>% count(network_set, rank_binned)

p <- ggplot(AUC_plot_nonPanel_count %>% filter(rank_binned %in% c("(0,5]", "(5,10]", "(10,20]")) 
          %>%
                                    mutate(network_set = factor(network_set, levels = selected_features, labels = str_remove(selected_features_label, "Reference: ")),
                                           rank_binned = factor(rank_binned, labels = paste0("Top ", c(5,10,20)))
                                          )
, aes(x = network_set, y = n, fill = network_set)) + 
 geom_col(position = "dodge") +
  facet_grid(.~ rank_binned) +
  scale_fill_manual(values = palettes) + #guides(col = FALSE) +
  cowplot::theme_cowplot() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  xlab("Ranking criteria") +
  ylab("# Patients") + labs(fill = "Ranking criterion")

#ggsave("../Figs/ranking_patient_top_compare_methods_genes_not_in_IDpanel.pdf", p, width = 8, height = 2)

p

scatter plot comparison:

Show code
AUC_plot_nonPanel_compare <- AUC_plot_nonPanel %>%
  filter(network_set %in% c("rank_HP", "informedMultiPlex"), causal) %>%
  select(`Experiment ID`, GeneName, network_set, rank) %>%
  pivot_wider(names_from = "network_set", values_from = "rank") 
  
selected_genes <- c("PAX3", "MAPK8IP3", "ZNF292", "FA2H", "PHIP", "CARPRIN1", "AGO2", "NFIB", "CSF1R", "SPAST")

p <- ggplot(AUC_plot_nonPanel_compare %>% filter(rank_HP < 200, informedMultiPlex < 200), aes(x = rank_HP, y = informedMultiPlex, label = ifelse(GeneName %in% selected_genes, GeneName, ""))) + geom_point(col = "grey20") + ggrepel::geom_text_repel() +
  geom_abline(slope = 1, intercept = 0, col = "grey50", linetype = "dashed")  + xlab("Rank: HP count") + ylab("Rank: Informed Multiplex Propagation") + theme_cowplot()

#ggsave("../Figs/ranking_patient_top_compare_methods_genes_not_in_IDpanel_scattered.pdf", p, width = 4, height = 4)

p

Compare the performance between the full and Temporal holdout set

Show code
auc_vals_df <- tibble(Method  = names(auc_vals), AUC = unlist(auc_vals)) %>%
  filter(Method %in%  c("rank_inHP", "rank_popularity", "rank_expression_BrainAll", "rank_pathwaycounts", 
"informedMultiPlex")) %>%
  mutate(Method = factor(Method, levels = c("rank_inHP", "rank_popularity", "rank_expression_BrainAll", "rank_pathwaycounts", 
"informedMultiPlex")
, labels = str_remove_all(selected_features_label, "Reference: |Feature: ")),
   #     col = palettes,
#   Method = as.character(Method),
  testList = "Full"
)

auc_vals_nonPanel_df <- tibble(Method  = names(auc_vals_nonPanel), AUC = unlist(auc_vals_nonPanel)) %>%
  filter(Method %in% selected_features) %>%
  mutate(Method = factor(Method, levels = selected_features, labels = str_remove_all(selected_features_label, "Reference: |Feature: ")),
   #     col = palettes,
         Method = as.character(Method, -AUC),
   testList = "-Panel") %>%
  arrange(AUC)



auc_vals_df_combn <- rbind(auc_vals_df, auc_vals_nonPanel_df) #%>%
 # mutate(Method = factor(Method, levels = selected_features_label))#,
         # Method = fct_reorder(Method, -(AUC)))

p <- ggplot(auc_vals_df_combn, aes(x = testList, y = AUC, fill = Method)) + geom_col() + 
  facet_grid(.~ Method)  + 
  coord_cartesian(ylim = c(0.5, 1)) + 
  scale_fill_manual(values = palettes) + theme_cowplot() + xlab("Patient list") 

#ggsave("../Figs/barplot_compare_AUC_full_vs_nonPanel_patientList.pdf", height = 3, width = 8)

p

Session Information

Show code
pacman::p_load(report)
report_session <- report(sessionInfo())
write_lines(report_session, "report_session.md")
print(report_session)
Analyses were conducted using the R Statistical language (version 3.6.3; R Core Team, 2020) on macOS  10.16, using the packages voronoiTreemap (version 0.2.1; Alexander Kowarik et al., 2021), cowplot (version 1.1.1; Claus Wilke, 2020), ggsignif (version 0.6.1; Constantin Ahlmann-Eltze and Indrajeet Patil, 2021), igraph (version 1.2.6; Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. https://igraph.org), hexbin (version 1.28.2; Dan Carr et al., 2021), RColorBrewer (version 1.1.2; Erich Neuwirth, 2014), S4Vectors (version 0.24.4; Pagès, Lawrence and Aboyoun, 2020), ggplot2 (version 3.3.3; Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.), reshape2 (version 1.4.4; Hadley Wickham, 2007), stringr (version 1.4.0; Hadley Wickham, 2019), tidyr (version 1.1.2; Hadley Wickham, 2020), forcats (version 0.5.1; Hadley Wickham, 2021), scales (version 1.1.1; Hadley Wickham and Dana Seidel, 2020), readr (version 1.4.0; Hadley Wickham and Jim Hester, 2020), dplyr (version 1.0.4; Hadley Wickham et al., 2021), AnnotationDbi (version 1.48.0; Hervé Pagès et al., 2019), ggforestplot (version 0.1.0; Ilari Scheinin et al., 2021), rmarkdown (version 2.7; JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone, 2021), ggrepel (version 0.9.1; Kamil Slowikowski, 2021), tibble (version 3.1.0; Kirill Müller and Hadley Wickham, 2021), IRanges (version 2.20.2; Lawrence M et al., 2013), purrr (version 0.3.4; Lionel Henry and Hadley Wickham, 2020), report (version 0.3.0; Makowski et al., 2020), org.Hs.eg.db (version 3.10.0; Marc Carlson, 2019), treemap (version 2.4.2; Martijn Tennekes, 2017), Biobase (version 2.46.0; Orchestrating high-throughput genomic analysis with Bioconductor. Huber, V.Carey, Gentleman, ..., Morgan Nature Methods, 2015:12, 115.), BiocGenerics (version 0.32.0; Orchestrating high-throughput genomic analysis with Bioconductor. Huber, V.Carey, Gentleman, ..., Morgan Nature Methods, 2015:12, 115.), ggstatsplot (version 0.7.0; Patil, 2018), pacman (version 0.5.1; Rinker et al., 2017), ggforce (version 0.3.2; Thomas Lin Pedersen, 2020), ggraph (version 2.0.4; Thomas Lin Pedersen, 2020), patchwork (version 1.1.1; Thomas Lin Pedersen, 2020), tidygraph (version 1.2.0; Thomas Lin Pedersen, 2020), MASS (version 7.3.53; Venables et al., 2002), tidyverse (version 1.3.0; Wickham et al., 2019), pROC (version 1.17.0.1; Xavier Robin et al., 2011) and knitr (version 1.31; Yihui Xie, 2021).

References
----------
  - Alexander Kowarik, Bernhard Meindl, Malaver Vojvodic, Mike Bostock and Franck Lebeau (2021). voronoiTreemap: Voronoi Treemaps with Added Interactivity by Shiny. R package version 0.2.1. https://github.com/uRosConf/voronoiTreemap
  - Claus O. Wilke (2020). cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2'. R package version 1.1.1. https://CRAN.R-project.org/package=cowplot
  - Constantin Ahlmann-Eltze and Indrajeet Patil (2021). ggsignif: Significance Brackets for 'ggplot2'. R package version 0.6.1. https://CRAN.R-project.org/package=ggsignif
  - Csardi G, Nepusz T: The igraph software package for complex network research, InterJournal, Complex Systems 1695. 2006. https://igraph.org
  - Dan Carr, ported by Nicholas Lewin-Koh, Martin Maechler and contains copies of lattice functions written by Deepayan Sarkar (2021). hexbin: Hexagonal Binning Routines. R package version 1.28.2. https://CRAN.R-project.org/package=hexbin
  - Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. https://CRAN.R-project.org/package=RColorBrewer
  - H. Pagès, M. Lawrence and P. Aboyoun (2020). S4Vectors: Foundation of vector-like and list-like containers in Bioconductor. R package version 0.24.4.
  - H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
  - Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.
  - Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
  - Hadley Wickham (2020). tidyr: Tidy Messy Data. R package version 1.1.2. https://CRAN.R-project.org/package=tidyr
  - Hadley Wickham (2021). forcats: Tools for Working with Categorical Variables (Factors). R package version 0.5.1. https://CRAN.R-project.org/package=forcats
  - Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales
  - Hadley Wickham and Jim Hester (2020). readr: Read Rectangular Text Data. R package version 1.4.0. https://CRAN.R-project.org/package=readr
  - Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.4. https://CRAN.R-project.org/package=dplyr
  - Hervé Pagès, Marc Carlson, Seth Falcon and Nianhua Li (2019). AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. R package version 1.48.0.
  - Ilari Scheinin, Maria Kalimeri, Vilma Jagerroos, Juuso Parkkinen, Emmi Tikkanen, Peter Würtz and Antti Kangas (2021). ggforestplot: Forestplots of Measures of Effects and Their Confidence Intervals. https://nightingalehealth.github.io/ggforestplot/index.html, https://github.com/nightingalehealth/ggforestplot.
  - JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.7. URL https://rmarkdown.rstudio.com.
  - Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel
  - Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.0. https://CRAN.R-project.org/package=tibble
  - Lawrence M, Huber W, Pag\`es H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118
  - Lionel Henry and Hadley Wickham (2020). purrr: Functional Programming Tools. R package version 0.3.4. https://CRAN.R-project.org/package=purrr
  - Makowski, D., Ben-Shachar, M.S., Patil, I. & Lüdecke, D. (2020). Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption. CRAN. Available from https://github.com/easystats/report. doi: .
  - Marc Carlson (2019). org.Hs.eg.db: Genome wide annotation for Human. R package version 3.10.0.
  - Martijn Tennekes (2017). treemap: Treemap Visualization. R package version 2.4-2. https://CRAN.R-project.org/package=treemap
  - Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods, 2015:12, 115.
  - Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods, 2015:12, 115.
  - Patil, I. (2018). ggstatsplot: 'ggplot2' Based Plots with Statistical Details. CRAN. Retrieved from https://cran.r-project.org/web/packages/ggstatsplot/index.html
  - R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  - Rinker, T. W. & Kurkiewicz, D. (2017). pacman: Package Management for R. version 0.5.0. Buffalo, New York. http://github.com/trinker/pacman
  - Thomas Lin Pedersen (2020). ggforce: Accelerating 'ggplot2'. R package version 0.3.2. https://CRAN.R-project.org/package=ggforce
  - Thomas Lin Pedersen (2020). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.0.4. https://CRAN.R-project.org/package=ggraph
  - Thomas Lin Pedersen (2020). patchwork: The Composer of Plots. R package version 1.1.1. https://CRAN.R-project.org/package=patchwork
  - Thomas Lin Pedersen (2020). tidygraph: A Tidy API for Graph Manipulation. R package version 1.2.0. https://CRAN.R-project.org/package=tidygraph
  - Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
  - Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
  - Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77 <http://www.biomedcentral.com/1471-2105/12/77/>
  - Yihui Xie (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.31.